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Abstract 


Software Fault Isolation (SFI) is an effective approach 
to sandboxing binary code of questionable provenance, 
an interesting use case for native plugins in a Web 
browser. We present software fault isolation schemes for 
ARM and x86-64 that provide control-flow and memory 
integrity with average performance overhead of under 
5% on ARM and 7% on x86-64. We believe these are the 
best known SFI implementations for these architectures, 
with significantly lower overhead than previous systems 
for similar architectures. Our experience suggests that 
these SFI implementations benefit from instruction-level 
parallelism, and have particularly small impact for work- 
loads that are data memory-bound, both properties that 
tend to reduce the impact of our SFI systems for future 
CPU implementations. 


1 Introduction 


As an application platform, the modern web browser has 
some noteworthy strengths in such areas as portability 
and access to Internet resources. It also has a number 
of significant handicaps. One such handicap is compu- 
tational performance. Previous work [30] demonstrated 
how software fault isolation (SFI) can be used in a sys- 
tem to address this gap for Intel 80386-compatible sys- 
tems, with a modest performance penalty and without 
compromising the safety users expect from Web-based 
applications. A major limitation of that work was its 
specificity to the x86, and in particular its reliance on x86 
segmented memory for constraining memory references. 
This paper describes and evaluates analogous designs for 
two more recent instruction set implementations, ARM 
and 64-bit x86, with pure software-fault isolation (SFI) 
assuming the role of segmented memory. 
The main contributions of this paper are as follows: 
e A design for ARM SFI that provides control flow 
and store sandboxing with less than 5% average 
overhead, 


e A design for x86-64 SFI that provides control flow 
and store sandboxing with less than 7% average 
overhead, and 


e A quantitative analysis of these two approaches on 
modern CPU implementations. 
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We will demonstrate that the overhead of fault isolation 
using these techniques is very low, helping to make SFI 
a viable approach for isolating performance critical, un- 
trusted code in a web application. 


1.1 Background 


This work extends Google Native Client [30].! Our 
original system provides efficient sandboxing of x86-32 
browser plugins through a combination of SFI and mem- 
ory segmentation. We assume an execution model where 
untrusted (hence sandboxed) code is multi-threaded, and 
where a trusted runtime supporting OS portability and se- 
curity features shares a process with the untrusted plugin 
module. 

The original NaCl x86-32 system relies on a set of 
rules for code generation that we briefly summarize here: 

e The code section is read-only and statically linked. 


e The code section is conceptually divided into fixed 
sized bundles of 32 bytes. 


e All valid instructions are reachable by a disassem- 
bly starting at a bundle beginning. 


e All indirect control flow instructions are re- 
placed by a multiple-instruction sequence (pseudo- 
instruction) that ensures target address alignment to 
a bundle boundary. 


e No instructions or pseudo-instructions in the binary 
crosses a bundle boundary. 


All rules are checked by a verifier before a program is 
executed. This verifier together with the runtime system 
comprise NaCls trusted code base (TCB). 

For complete details on the x86-32 system please refer 
to our earlier paper [30]. That work reported an average 
overhead of about 5% for control flow sandboxing, with 
the bulk of the overhead being due to alignment consid- 
erations. The system benefits from segmented memory 
to avoid additional sandboxing overhead. 

Initially we were skeptical about SFI as a replace- 
ment for hardware memory segments. This was based 
in part on running code from previous research [19], in- 
dicating about 25% overhead for x86-32 control+store 
SFI, which we considered excessive. AS we continued 


'We abbreviate Native Client as “NaCl” when used as an adjective. 
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our exploration of ARM SFI and sought to understand 
ARM behavior relative to x86 behavior, we could not ad- 
equately explain the observed performance gap between 
ARM SFI at under 10% overhead with the overhead on 
X86-32 in terms of instruction set differences. With fur- 
ther study we understood that the prior implementations 
for x86-32 may have suffered from suboptimal instruc- 
tion selection and overly pessimistic alignment. 


Reliable disassembly of x86 machine code figured 
largely into the motivation of our previous sandbox de- 
sign [30]. While the challenges for x86-64 are substan- 
tially similar, it may be less clear why analogous rules 
and validation are required for ARM, given the relative 
simplicity of the ARM instruction encoding scheme, so 
we review a few relevant considerations here. Modern 
ARM implementations commonly support 16-bit Thumb 
instruction encodings in addition to 32-bit ARM instruc- 
tions, introducing the possibility of overlapping instruc- 
tions. Also, ARM binaries commonly include a number 
of features that must be considered or eliminated by our 
sandbox implementation. For example, ARM binaries 
commonly include read-only data embedded in the text 
segment. Such data in executable memory regions must 
be isolated to ensure it cannot be used to invoke system 
call instructions or other instructions incompatible with 
our sandboxing scheme. 


Our architecture further requires the coexistence of 
trusted and untrusted code and data in the same pro- 
cess, for efficient interaction with the trusted runtime that 
provides communications and portable interaction with 
the native operating system and the web browser. As 
such, indirect control flow and memory references must 
be constrained to within the untrusted memory region, 
achieved through sandboxing instructions. 


We briefly considered using page protection as an al- 
ternative to memory segments [26]. In such an ap- 
proach, page-table protection would be used to prevent 
the untrusted code from manipulating trusted data; SFI is 
still required to enforce control-flow restrictions. Hence, 
page-table protection can only avoid the overhead of data 
SFI; the control-flow SFI overhead persists. Also, further 
use of page protection adds an additional OS-based pro- 
tection mechanism into the system, in conflict with our 
requirement of portability across operating systems. This 
OS interaction is complicated by the requirement for 
multiple threads that transition independently between 
untrusted (sandboxed) and trusted (not sandboxed) ex- 
ecution. Due to the anticipated complexity and over- 
head of this OS interaction and the small potential per- 
formance benefit we opted against page-based protection 
without attempting an implementation. 
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2 System Architecture 


The high-level strategy for our ARM and x86-64 sand- 
boxes builds on the original Native Client sandbox for 
x86-32 [30], which we will call NaCl-ARM, NaCl-x86- 
64, and NaCl-x86-32 respectively. The three approaches 
are compared in Table 1. Both NaCl-ARM and NaCl- 
x86-64 sandboxes use alignment masks on control flow 
target addresses, similar to the prior NaCl-x86-32 sys- 
tem. Unlike the prior system, our new designs mask 
high-order address bits to limit control flow targets to a 
logical zero-based virtual address range. For data ref- 
erences, stores are sandboxed on both systems. Note 
that reads of secret data are generally not an issue as the 
address space barrier between the NaCl module and the 
browser protects browser resources such as cookies. 

In the absence of segment protection, our ARM and 
X86-64 systems must sandbox store instructions to pre- 
vent modification of trusted data, such as code addresses 
on the trusted stack. Although the layout of the address 
space differs between the two systems, both use a combi- 
nation of masking and guard pages to keep stores within 
the valid address range for untrusted data. To enable 
faster memory accesses through the stack pointer, both 
systems maintain the invariant that the stack pointer al- 
ways holds a valid address, using guard pages at each 
end to catch escapes due to both overflow/underflow and 
displacement addressing. 

Finally, to encourage source-code portability between 
the systems, both the ARM and the x86-64 systems use 
ILP32 (32-bit Int, Long, Pointer) primitive data types, as 
does the previous x86-32 system. While this limits the 
64-bit system to a4GB address space, it can also improve 
performance on x86-64 systems, as discussed in section 
Dia! 

At the level of instruction sequences and address space 
layout, the ARM and x86-64 data sandboxing solutions 
are very different. The ARM sandbox leverages instruc- 
tion predication and some peculiar instructions that allow 
for compact sandboxing sequences. In our x86-64 sys- 
tem we leverage the very large address space to ensure 
that most x86 addressing modes are allowed. 


3 Implementation 


3.1 ARM 


The ARM takes many characteristics from RISC micro- 
processor design. It is built around a load/store archi- 
tecture, 32-bit instructions, 16 general purpose registers, 
and a tendency to avoid multi-cycle instructions. It devi- 
ates from the simplest RISC designs in several ways: 
e condition codes that can be used to predicate most 
instructions 


e “Thumb-mode” 16-bit instruction extensions can 
improve code density 
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Feature S*dYCNC XRD 


Addressable memory 
Virtual base address Any 
Data model ILP32 
Reserved registers O of 8 
Data address mask method None 


Control address mask method Explicit instruction 
Bundle size (bytes) 32 

Data embedded in text segment || Forbidden 
“Safe” addressing registers All 

Effect of out-of-sandbox store Trap 
Effect of out-of-sandbox jump Trap 


aCl-x86-32 | NaCl-ARM NaCl-x86-64 
1GB 1GB AGB 


0 44GB 

ILP32 ILP32 

0 of 15 1 of 16 

Explicit instruction | Implicit in result width 
Explicit instruction | Explicit instruction 

16 32 

Permitted Forbidden 

sp Psp, op 

No effect (typically) | Wraps mod 4GB 
Wraps mod 1GB Wraps mod 4GB 





Table 1: Feature Comparison of Native Client SFI schemes. NB: the current release of the Native Client system have changed since 
the first report [30] was written, where the addressable memory size was 256MB. Other parameters are unchanged. 


e relatively complex barrel shifter and addressing 
modes 


While the predication and shift capabilities directly ben- 
efit our SFI implementation, we restrict programs to the 
32-bit ARM instruction set, with no support for variable- 
length Thumb and Thumb-2 encodings. While Thumb 
encodings can incrementally reduce text size, most 1m- 
portant on embedded and handheld devices, our work tar- 
gets more powerful devices like notebooks, where mem- 
ory footprint is less of an issue, and where the negative 
performance impact of Thumb encodings is a concern. 
We confirmed our choice to omit Thumb encodings with 
a number of major ARM processor vendors. 

Our sandbox restricts untrusted stores and control flow 
to the lowest 1GB of the process virtual address space, 
reserving the upper 3GB for our trusted runtime and the 
operating system. As on x86-64, we do not prevent un- 
trusted code from reading outside its sandbox. Isolating 
faults in ARM code thus requires: 

e Ensuring that untrusted code cannot execute any 

forbidden instructions (e.g. undefined encodings, 
raw system calls). 


e Ensuring that untrusted code cannot store to mem- 
ory locations above IGB. 


e Ensuring that untrusted code cannot jump to mem- 
ory locations above IGB (e.g. into the service run- 
time implementation). 


We achieve these goals by adapting to ARM the ap- 
proach described by Wahbe et al. [28]. We make three 
significant changes, which we summarize here before re- 
viewing the full design in the rest of this section. First, 
we reserve no registers for holding sandboxed addresses, 
instead requiring that they be computed or checked in 
a single instruction. Second, we ensure the integrity of 
multi-instruction sandboxing pseudo-instructions with a 
variation of the approach used by our earlier x86-32 sys- 
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tem [30], adapted to further prevent execution of embed- 
ded data. Finally, we leverage the ARM’s fully predi- 
cated instruction set to introduce an alternative data ad- 
dress sandboxing sequence. This alternative sequence 
replaces a data dependency with a control dependency, 
preventing pipeline stalls and providing better overhead 
on multiple-issue and out-of-order microarchitectures. 


3.1.1 Code Layout and Validation 


On ARM, as on x86-32, untrusted program text is sepa- 
rated into fixed-length bundles, currently 16 bytes each, 
or four machine instructions. All indirect control flow 
must target the beginning of a bundle, enforced at run- 
time with address masks detailed below. Unlike on the 
X86-32, we do not need bundles to prevent overlapping 
instructions, which are impossible in ARM’s 32-bit in- 
struction encoding. They are necessary to prevent indi- 
rect control flow from targeting the interior of pseudo- 
instruction and bundle-aligned “trampoline” sequences. 
The bundle structure also allows us to support data em- 
bedded in the text segment, with data bundles starting 
with an invalid instruction (currently bkpt 0x7777) 
to prevent execution as code. 

The validator uses a fall-through disassembly of the 
text to identify valid instructions, noting the interior of 
pseudo-instructions and data bundles are not valid con- 
trol flow targets. When it encounters a direct branch, 
it further confirms that the branch target is a valid in- 
struction. For indirect control flow, many ARM opcodes 
can cause a branch by writing r15, the program counter. 
We forbid most of these instructions” and consider only 
explicit branch-to-address-in-register forms such as bx 
rO and their conditional equivalents. This restriction is 
consistent with recent guidance from ARM for compiler 


*We do permit the instructionbic r15, rN, MASK Although it 
allows a single-instruction sandboxed control transfer, it can have poor 
branch prediction performance. 
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writers. Any such branch must be immediately preceded 
by an instruction that masks the destination register. The 
mask must clear the most significant two bits, restricting 
branches to the low 1GB, and the four least significant 
bits, restricting targets to bundle boundaries. In 32-bit 
ARM, the Bit Clear (bic) instruction can clear up to 
eight bits rotated to any even bit position. For example, 
this pseudo-instruction implements a sandboxed branch 
through r0 in eight bytes total, versus the four bytes re- 
quired for an unsandboxed branch: 


bie 20) LU FOxscoOo0000CE 
bx rO 


As we do not trust the contents of memory, the com- 
mon ARM return idiom pop {pc} cannot be used. In- 
stead, the return address must be popped into a register 
and masked: 


poo 4 Le 4] 
bic br. Te, FOReCOUCQU0E 
Dae. de 


Branching through LR (the link register) 1s still recog- 
nized by the hardware as a return, so we benefit from 
hardware return stack prediction. Note that these se- 
quences introduce a data dependency between the bx 
branch instruction and its adjacent masking instruction. 
This pattern (generating an address via the ALU and im- 
mediately jumping to it) is sufficiently common in ARM 
code that the modern ARM implementations [3] can dis- 
patch the sequence without stalling. 

For stores, we check that the address is confined to the 
low 1GB, with no alignment requirement. Rather than 
destructively masking the address, as we do for control 
flow, we use a t st instruction to verify that the most 
significant bit is clear together with a predicated store:° 


tst r0O, #0xc0000000 
stregq fl» \[r0, #12] 


Like bic, tst uses an eight-bit immediate rotated 
to any even position, so the encoding of the mask is 
efficient. Using tst rather than bic here avoids a 
data dependency between the guard instruction and the 
store, eliminating a two-cycle address-generation stall 
on Cortex-A8 that would otherwise triple the cost of 
the added instruction. This illustrates the usefulness of 
the ARM architecture’s fully predicated instruction set. 
Some predicated SFI stores can also be synthesized in 
this manner, using sequences such as tsteq/streq. 
For cases where the compiler has selected a predicated 
store that cannot be synthesized with tst, we revert 
to a bic-based sandbox, with the consequent address- 
generation stall. 


3The eq condition checks the Z flag, which t st will set if the se- 
lected bit is clear. 
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We allow only base-plus-displacement addressing 
with immediate displacement. Addressing modes that 
combine multiple registers to compute an effective ad- 
dress are forbidden for now. Within this limitation, we 
allow all types of stores, including the Store-Multiple 
instruction and DMA-style stores through coprocessors, 
provided the address is checked or masked. We allow the 
ARM architecture’s full range of pre- and post-increment 
and decrement modes. Note that since we mask only the 
base address and ARM immediate displacements can be 
up to +4095 bytes, stores can access a small band of 
memory outside the 1GB data region. We use guard 
pages at each end of the data region to trap such ac- 
cesses." 


3.1.2 Stores to the Stack 


To allow optimized stores through the stack pointer, we 
require that the stack pointer register (SP) always con- 
tain a valid data address. To enforce this requirement, 
we initialize SP with a valid address before activating 
the untrusted program, with further requirements for the 
two kinds of instructions that modify SP. Instructions that 
update SP as a side-effect of a memory reference (for ex- 
ample pop) are guaranteed to generate a fault if the mod- 
ified SP is invalid, because of our guard regions at either 
end of data space. Instructions that update SP directly 
are sandboxed with a subsequent masking instruction, as 
in: 


Mov Se, wl 
bie -SP,. SP, FCOC000000 


This approach could someday be extended to other reg- 
isters. For example, C-like languages might benefit from 
a frame pointer handled in much the same way as the SP, 
as we do for x86-64, while Java and C++ might addition- 
ally benefit from efficient stores through this. In these 
cases, we would also permit moves between any two 
such data-addressing registers without requiring mask- 
ing. 


3.1.3. Reference Compiler 


We have modified LLVM 2.6 [13] to implement our 
ARM SFI design. We chose LLVM because it appeared 
to allow an easier implementation of our SFI design, and 
to explore its use in future cross-platform work. In prac- 
tice we have also found it to produce faster ARM code 
than GCC, although the details are outside the scope of 
this paper. The SFI changes were restricted to the ARM 
target implementation within the 11c binary, and re- 
quired approximately 2100 lines of code and table mod- 
ifications. For the results presented in this paper we used 


4The guard pages “below” the data region are actually at the top of 
the address space, where the OS resides, and are not accessible from 
user mode. 
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the compiler to generate standard Linux executables with 
access to the full instruction set. This allows us to isolate 
the behavior of our SFI design from that of our trusted 
runtime. 


3.2 x86-64 


While the mechanisms of our x86-64 implementation 
are mostly analogous to those of our ARM implemen- 
tation, the details are very different. As with ARM, a 
valid data address range is surrounded by guard regions, 
and modifications to the stack pointer (rsp) and base 
pointer (rbp) are masked or guarded to ensure they al- 
ways contain a valid address. Our ARM approach relies 
on being able to ensure that the lowest 1GB of address 
space does not contain trusted code or data. Unfortu- 
nately this is not possible to ensure on some 64-bit Win- 
dows versions, which rules out simply using an address 
mask as ARM does. Instead, our x86-64 system takes 
advantage of more sophisticated addressing modes and 
use a small set of “controlled” registers as the base for 
most effective address computations. The system uses 
the very large address space, with a 4GB range for valid 
addresses surrounded by large (multiples of 4GB) un- 
mapped/protected regions. In this way many common 
x86 addressing modes can be used with little or no sand- 
boxing. 

Before we describe the details of our design, we pro- 
vide some relevant background on AMD’s 64-bit exten- 
sions to x86. Apart from the obvious 64-bit address 
space and register width, there are a number of perfor- 
mance relevant changes to the instruction set. The x86 
has an established practice of using related names to 
identify overlapping registers of different lengths; for ex- 
ample ax refers to the lower 16-bits of the 32-bit eax. In 
x86-64, general purpose registers are extended to 64-bits, 
with an r replacing the e to identify the 64 vs. 32-bit reg- 
isters, as in rax. x86-64 also introduces eight new gen- 
eral purpose registers, as a performance enhancement, 
named r8 - r15. To allow legacy instructions to use 
these additional registers, x86-64 defines a set of new 
prefix bytes to use for register selection. A relatively 
small number of legacy instructions were dropped from 
the x86-64 revision, but they tend to be rarely used in 
practice. 

With these details in mind, the following code genera- 
tion rules are specific to our x86-64 sandbox: 

e The module address space is an aligned 4GB region, 
flanked above and below by protected/unmapped re- 
gions of 1Ox4GB, to compensate for scaling (c.f. 
below) 


e A designated register “RZP” (currently r15) is ini- 
tialized to the 4GB-aligned base address of un- 
trusted memory and is read-only from untrusted 
code. 
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e All rip update instructions must use RZP. 


To ensure that rsp and rbp contain a valid data address 
we use a few additional constraints: 
e rbp can be modified via a copy from rsp with no 
masking required. 


e «sp can be modified via a copy from rbp with no 
masking required. 


e Other modifications to rsp and rbp must be done 
with a pseudo-instruction that post-masks the ad- 
dress, ensuring that it contains a valid data address. 


For example, a valid rsp update sequence looks like 
this: 


Sesp = %eax 
lea (%RZP, Srsp, 1), *rsp 


In this sequence the assignment® to esp guarantees that 
the top 32-bits of rsp are cleared, and the subsequent 
add sets those bits to the valid base. Of course such se- 
quences must always be executed in their entirety. Given 
these rules, many common store instructions can be used 
with little or no sandboxing required. Push, pop and 
near call do not require checking because the up- 
dated value of rsp is checked by the subsequent mem- 
ory reference. The safety of a store that uses rsp or rbp 
with a simple 32-bit displacement: 


mov disp32(%rsp), %Seax 


follows from the validity invariant on rsp and the guard 
ranges that absorb the displacement, with no masking re- 
quired. The most general addressing expression for an 
allowed store combines a valid base register (rsp, rbp 
or RZP) with a 32-bit displacement, a 32-bit index, and 
a scaling factor of 1, 2, 4, or 8. The effective address is 
computed as: 


basereg + indexreg * scale + disp32 
For example, in this pseudo-instruction: 


add SOx00abcdef, %ecx 
mov %eax, disp32(%SRZP, trex, scale) 


the upper 32 bits of rcx are cleared by the arithmetic 
operation on ecx. Note that any operation on ecx 
will clear the top 32 bits of rcx. This required mask- 
ing operation can often be combined other useful oper- 
ations. Note that this general form allows generation of 
addresses in a range of approximately 1OOGB, with the 


>We have used the = operation to indicate assignment to the register 
on the left hand side. There are several instructions, such as lea or 
mov zx that can be used to perform this assignment. Other instructions 
are written using ATT syntax. 
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Figure 1: SPEC2000 SFI Performance Overhead for the ARM 
Cortex-A9. 


valid 4GB range near the middle. By reserving and un- 
mapping addresses outside the 4GB range we can ensure 
that any dereference of an address outside the valid range 
will lead to a fault. Clearly this scheme relies heavily on 
the very large 64-bit address space. 

Finally, note that updates to the instruction pointer 
must align the address to 0 mod 32 and initialize the 
top 32-bits of address from RZP as in this example us- 
ing rdx: 


Sedx = 

and OXETIritTred, sedx 

Léa (eRAP;, srakx; 1), erdx 
jmp *%srdx 


Our x86-64 SFI implementation is based on GCC 
4.4.3, requiring a patch of about 2000 lines to the com- 
piler, linker and assembler source. At a high level, 
the changes include supporting the new call/return se- 
quences, making pointers and longs 32 bits, allocating 
r15 for use as RZB, and constraining address generation 
to meet the above rules. 


4 Evaluation 


In this section we evaluate the performance of our ARM 
and x86-64 SFI schemes by comparing against the rel- 
evant non-SFI baselines, using C and benchmarks from 
SPEC2000 INT CPU [12]. Our main analysis is based on 
out-of-order CPUs, with additional measurements for in- 
order systems at the end of this section. The out-of-order 
systems we used for our experiments were: 

e For x86-64, a 2.4GHz Intel Core 2 Quad with 8GB 

of RAM, running Ubuntu Linux 8.04, and 


e For ARM, a 1GHz Cortex-A9 (Nvidia Tegra T20) 
with 512MB of RAM, running Ubuntu Linux 9.10. 


4.1 ARM 


For ARM, we compared LLVM 2.6 [13] to the same 
compiler modified to support our SFI scheme. Figure 1 
Summarizes the ARM results, with tabular data in Ta- 
ble 2. Average overhead is about 5% on the out-of-order 
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x86-64 | SFI vs. | SFI vs. || ARM 
| ce | re | nes | “Se 
164.gzip 
175.vpr 
176.gcc 


181 .mcf 


186.crafty 


197.parser 
253.perlbmk 
254.gap 
255.vortex 
256.bzip2 
300.twolf 


rgeomean | 147} 524] 69 | 5.17 


Table 2: SPEC2000 SFI Performance Overhead (percent). The 
first column compares x86-64 SFI overhead to the “oracle” 
baseline compiler. 





ARM [ARM SFT | “ine. | %pad_ 
164.gzip 73 90 24 13 
175.vpr 225 
176.gcc 1586 
181.mcf 84 
186.crafty 320 


197.parser 219 
253.perlbmk 812 
254.gap 531 
255.vortex 720 
256.bzip2 74 
300.twolf 289 





Table 3: ARM SPEC2000 text segment size in kilobytes, with 
% increase and % padding instructions. 


Cortex-A9, and is fairly consistent across the bench- 
marks. Increases in binary size (Table 3) are compara- 
ble at around 20% (generally about 10% due to align- 
ment padding and 10% due to added instructions, shown 
in the rightmost columns of the table). We believe the 
observed overhead comes primarily from the increase in 
code path length. For mcf, this benchmark is known to 
be data-cache intensive [17], a case in which the addi- 
tional sandboxing instructions have minimal impact, and 
can sometimes be hidden by out-of-order execution on 
the Cortex-A9. We see the largest slowdowns for gap, 
gzip, and perlbmk. We suspect these overheads are 
a combination of increased path length and instruction 
cache penalties, although we do not have access to ARM 
hardware performance counter data to confirm this hy- 
pothesis. 
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Figure 2: SPEC2000 SFI Performance Overhead for x86-64. 
SFI performance is compared to the faster of -m32 and -—m64 
compilation. 


4.2 x86-64 


Our x86-64 comparisons are based on GCC 4.4.3. The 
selection of a performance baseline is not straightfor- 
ward. The available compilation modes for x86 are ei- 
ther 32-bit (ILP32, —m32) or 64-bit (LP64, —m64). Each 
represents a performance tradeoff, as demonstrated pre- 
viously [15, 25]. In particular, the 32-bit compilation 
model’s use of ILP32 base types means a smaller data 
working set compared to standard 64-bit compilation in 
GCC. On the other hand, use of the 64-bit instruction set 
offers additional registers and a more efficient register- 
based calling convention compared to standard 32-bit 
compilation. Ideally we would compare our SFI com- 
piler to a version of GCC that uses ILP32 and the 64-bit 
instruction set, but without our SFI implementation. In 
the absence of such a compiler, we consider a hypothet- 
ical compiler that uses an oracle to automatically select 
the faster of -m32 and —m64 compilation. Unless other- 
wise noted all GCC compiles used the —O2 optimization 
level. 

Figure 2 and Table 2 provide x86-64 results, where 
average SFI overhead is about 5% compared to —m32, 
7% compared to —m64 and 15% compared to the ora- 
cle compiler. Across the benchmarks, the distribution 
is roughly bi-modal. For parser and gap, SFI per- 
formance is better than either —m32 or —m64 binaries 
(Table 4). These are also cases where —m64 execution 
is Slower than —m32, indicative of data-cache pressure, 
leading us to believe that the beneficial impact additional 
registers dominates SFI overhead. Three other bench- 
marks (vpr, mcf and twolf) show SFI impact is less 
than 2%. We believe these are memory-bound and do not 
benefit significantly from the additional registers. 

At the other end of the range, four benchmarks, 
gcc, crafty, perlbmk and vortex show perfor- 
mance overhead greater than 25%. All run as fast or 
faster for —m64 than —m32, suggesting that data-cache 
pressure does not dominate their performance. Gcc, 
perlbmk and vortex have large text, and we sus- 
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164.gzip 
175.vpr 
176.gcc 
181.mcf 


186.crafty 


197.parser 
253.perlbmk 
254.gap 
255.vortex 
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Table 4: SPEC2000 x86-64 execution times, in seconds. 


82 [m6 |S 
164.gzip 82 85 
175.vpr 239 
176.gcc 1868 
181.mcf 20 
186.crafty 286 


197.parser 243 
253.perlbmk 746 
254.gap 955 
255.vortex 643 
256.bzip2 98 
300.twolf SD) 





Table 5: SPEC2000 x86 text sizes, in kilobytes. 


pect SFI code-size increase may be contributing to in- 
struction cache pressure. From hardware performance 
counter data, crafty shows a 26% increase in instruc- 
tions retired and an increase in branch mispredicts from 
2% to 8%, likely contributors to the observed SFI perfor- 
mance overhead. We have also observed that per1lbmk 
and vortex are very sensitive to memcpy performance. 
Our x86-64 experiments are using a relative simple im- 
plementation of memcpy, to allow the same code to be 
used with and without the SFI sandbox. In our continu- 
ing work we are adapting a tuned memcpy implementa- 
tion to work within our sandbox. 


4.3. In-Order vs. Out-of-Order CPUs 


We suspected that the overhead of our SFI scheme would 
be hidden in part by CPU microarchitectures that bet- 
ter exploit instruction-level parallelism. In particular, 
we suspected we would be helped by the ability of out- 
of-order CPUs to schedule around any bottlenecks that 
SFI introduces. Fortunately, both architectures we tested 
have multiple implementations, including recent prod- 
ucts with in-order dispatch. To test our hypothesis, we 
ran a subset of our benchmarks on in-order machines: 
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Figure 3: Additional SPEC2000 SFI overhead on in-order mi- 
croarchitectures. 
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Table 6: Comparison of SPEC2000 overhead (percent) for in- 
order vs. out-of-order microarchitecture. 


e A 1.6GHz Intel Atom 330 with 2GB of RAM, run- 
ning Ubuntu Linux 9.10. 


e A 500MHz Cortex-A8 (Texas Instruments 
OMAP3540) with 256MB of RAM, running 
Angstrom Linux. 


The results are shown in Figure 3 and Table 6. For 
our x86-64 SFI scheme, the incremental overhead can be 
significantly higher on the Atom 330 compared to a Core 
2 Duo. This suggests out-of-order execution can help 
hide the overhead of SFI, although other factors may also 
contribute, including much smaller caches on the Atom 
part and the fact that GCC’s 64-bit code generation may 
be biased towards the Core 2 microarchitecture. These 
results should be considered preliminary, as there are a 
number of optimizations for Atom that are not yet avail- 
able in our compiler, including Atom-specific instruction 
scheduling and better selection of no-ops. Generation of 
efficient SFI code for in-order x86-64 systems is an area 
of continuing work. 

The story on ARM is different. While some bench- 
marks (notably gap) have higher overhead, some (such 
as parser) have equally reduced overhead. We were 
surprised by this result, and suggest two factors to ac- 
count for it. First, microarchitectural evaluation of the 
Cortex-A8 [3] suggests that the instruction sequences 
produced by our SFI can be issued without encountering 
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a hazard that would cause a pipeline stall. Second, we 
suggest that the Cortex-A9, as the first widely-available 
out-of-order ARM chip, might not match the maturity 
and sophistication of the Core 2 Quad. 


5 Discussion 


Given our initial goal to impact execution time by less 
than 10%, we believe these SFI designs are promising. 
At this level of performance, most developers targeting 
our system would do better to tune their own code rather 
than worry about SFI overhead. At the same time, the 
geometric mean commonly used to report SPEC results 
does a poor job of capturing the system’s performance 
characteristics; nobody should expect to get “average” 
performance. As such we will continue our efforts to 
reduce the impact of SFI for the cases with the largest 
slowdowns. 

Our work fulfills a prediction that the costs of SFI 
would become lower over time [28]. While thoughtful 
design has certainly helped minimize SFI performance 
impact, our experiments also suggest how SFI has bene- 
fited from trends in microarchitecture. Out-of-order ex- 
ecution, multi-issue architectures, and the effective gap 
between memory speed and CPU speed all contribute 
to reduce the impact of the register-register instructions 
used by our sandboxing schemes. 

We were surprised by the low overhead of the ARM 
sandbox, and that the x86-64 sandbox overhead should 
be so much larger by comparison. Clever ARM in- 
struction encodings definitely contributed. Our design 
directly benefits from the ARM’s powerful bit-clear in- 
struction and from predication on stores. It usually re- 
quires one instruction per sandboxed ARM operation, 
whereas the x86-64 sandbox frequently requires extra in- 
structions for address calculations and adds a prefix byte 
to many instructions. The regularity of the ARM instruc- 
tion set and smaller bundles (16 vs. 32 bytes) also means 
that less padding is required for the ARM, hence less 
instruction cache pressure. The x86-64 design also in- 
duces branch misprediction through our omission of the 
ret instruction. By comparison the ARM design uses 
the normal return idiom hence minimal impact on branch 
prediction. We also note that the x86-64 systems are gen- 
erally clocked at a much higher rate than the ARM sys- 
tems, making the relative distance to memory a possible 
factor. Unfortunately we do not have data to explore this 
question thoroughly at this time. 

We were initially troubled by the result that our system 
improves performance for so many benchmarks com- 
pared to the common —m32 compilation mode. This 
clearly results from the ability of our system to leverage 
features of the 64-bit instruction set. There is a sense in 
which the comparison is unfair, as running a 32-bit bi- 
nary on a 64-bit machine leaves a lot of resources idle. 
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Our results demonstrate in part the benefit of exploiting 
those additional resources. 

We were also surprised by the magnitude of the posi- 
tive impact of ILP32 primitive types for a 64-bit binary. 
For now our x86-64 design benefits from this as yet un- 
exploited opportunity, although based on our experience 
the community might do well to consider making [LP32 
a standard option for x86-64 execution. 

In our continuing work we are pursuing opportuni- 
ties to reduce SFI overhead of our x86-64 system, which 
we do not consider satisfactory. Our current alignment 
implementation is conservative, and we have identified 
a number of opportunities to reduce related padding. 
We will be moving to GCC version 4.5 which has 
instruction-scheduling improvements for in-order Atom 
systems. In the fullness of time we look forward to devel- 
oping an infrastructure for profile-guided optimization, 
which should provide opportunities for both instruction 
cache and branch optimizations. 


6 Related Work 


Our work draws directly on Native Client, a previous 
system for sandboxing 32-bit x86 modules [30]. Our 
scheme for optimizing stack references was informed 
by an earlier system described by McCamant and Mor- 
risett [18]. We were heavily influenced by the original 
software fault isolation work by Wahbe, Lucco, Ander- 
son and Graham [28]. 

Although there is a large body of published research 
on software fault isolation, we are aware of no publica- 
tions that specifically explore SFI for ARM or for the 
64-bit extensions of the x86 instruction set. SFI for 
SPARC may be the most thoroughly studied, being the 
subject of the original SFI paper by Wahbe et al. [28] 
and numerous subsequent studies by collaborators of 
Wahbe and Lucco [2, 16, 11] and independent investi- 
gators [4, 5, 8, 9, 10, 14, 22, 29]. As this work matured, 
much of the community’s attention turned to a more vir- 
tual machine-oriented approach to isolation, incorporat- 
ing a trusted compiler or interpreter into the trusted core 
of the system. 

The ubiquity of the 32-bit x86 instruction set has cat- 
alyzed development of a number of additional sandbox- 
ing schemes. MiSFIT [23] contemplated use of software 
fault isolation to constrain untrusted kernel modules [24]. 
Unlike our system, they relied on a trusted compiler 
rather than a validator. SystemTAP and XFI [21, 7] fur- 
ther contemplate x86 sandboxing schemes for kernel ex- 
tension modules. McCamant and Morrisett [18, 19] stud- 
ied x86 SFI towards the goals of system security and re- 
ducing the performance impact of SFI. 

Compared to our sandboxing schemes, CFI [1] pro- 
vides finer-grained control flow integrity. Whereas our 
systems only guarantee indirect control flow will target 
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an aligned address in the text segment, CFI can restrict 
a specific control transfer to a fairly arbitrary subset of 
known targets. While this more precise control is useful 
in some scenarios, such as ensuring integrity of transla- 
tions from higher-level languages, our use of alignment 
constraints helps simplify our design and implementa- 
tion. CFI also has somewhat higher average overhead 
(15% on SPEC2000), not surprising since its instrumen- 
tation sequences are longer than ours. XFI [7] adds 
to CFI further integrity constraints such as on memory 
and the stack, with additional overhead. More recently, 
BGI [6] considers an innovative scheme for constrain- 
ing the memory activity of device drivers, using a large 
bitmap to track memory accessibility at very fine gran- 
ularity. None of these projects considered the problem 
of operating system portability, a key requirement of our 
systems. 

The Nooks system [26] enhances operating system 
kernel reliability by isolating trusted kernel code from 
untrusted device driver modules using a transparent OS 
layer called the Nooks Isolation Manager (NIM). Like 
Native Client, NIM uses memory protection to isolate 
untrusted modules. As the NIM operates in the kernel, 
x86 segments are not available. The NIM instead uses a 
private page table for each extension module. To change 
protection domains, the NIM updates the x86 page ta- 
ble base address, an operation that has the side effect 
of flushing the x86 Translation Lookaside Buffer (TLB). 
In this way, NIM’s use of page tables suggests an alter- 
native to segment protection as used by NaCl-x86-32. 
While a performance analysis of these two approaches 
would likely expose interesting differences, the compar- 
ison is moot on the x86 as one mechanism is available 
only within the kernel and the other only outside the ker- 
nel. A critical distinction between Nooks and our sand- 
boxing schemes is that Nooks is designed only to pro- 
tect against unintentional bugs, not abuse. In contrast, 
our sandboxing schemes must be resistant to attempted 
deliberate abuse, mandating our mechanisms for reliable 
X86 disassembly and control flow integrity. These mech- 
anisms have no analog in Nooks. 

Our system uses a static validator rather than a trusted 
compiler, similar to validators described for other sys- 
tems [7, 18, 19, 21], applying the concept of proof- 
carrying code [20]. This has the benefit of greatly re- 
ducing the size of the trusted computing base [27], and 
obviates the need for cryptographic signatures from the 
compiler. Apart from simplifying the security implemen- 
tation, this has the further benefit of opening our system 
to 3rd-party tool chains. 


7 Conclusion 


This paper has presented practical software fault isola- 
tion systems for ARM and for 64-bit x86. We believe 
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these systems demonstrate that the performance over- 
head of SFI on modern CPU implementations is small 
enough to make it a practical option for general purpose 
use when executing untrusted native code. Our experi- 
ence indicates that SFI benefits from trends in microar- 
chitecture, such as out-of-order and multi-issue CPU 
cores, although further optimization may be required to 
avoid penalties on some recent low power in-order cores. 
We further found that for data-bound workloads, mem- 
ory latency can hide the impact of SFI. 

Source code for Google Native Client can be found at: 


helosy 7 code <goog1le:Gomy p/nativeclent,. 
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Making Linux Protection Mechanisms Egalitarian with UserFS 
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ABSTRACT 


User FS provides egalitarian OS protection mechanisms 
in Linux. UserFS allows any user—not just the system 
administrator—to allocate Unix user IDs, to use chroot, 
and to set up firewall rules in order to confine untrusted 
code. One key idea in UserFS is representing user IDs as 
files in a /proc-like file system, thus allowing applica- 
tions to manage user IDs like any other files, by setting 
permissions and passing file descriptors over Unix do- 
main sockets. UserFS addresses several challenges in 
making user IDs egalitarian, including accountability, re- 
source allocation, persistence, and UID reuse. We have 
ported several applications to take advantage of UserFS; 
by changing just tens to hundreds of lines of code, we 
prevented attackers from exploiting application-level vul- 
nerabilities, such as code injection or missing ACL checks 
in a PHP-based wiki application. Implementing UserFS 
requires minimal changes to the Linux kernel—a single 
3,000-line kernel module—and incurs no performance 
overhead for most operations, making it practical to de- 
ploy on real systems. 


1 INTRODUCTION 


OS protection mechanisms are key to mediating access 
to OS-managed resources, such as the file system, the 
network, or other physical devices. For example, system 
administrators can use Unix user IDs to ensure that dif- 
ferent users cannot corrupt each other’s files; they can 
set up a chroot jail to prevent a web server from access- 
ing unrelated files; or they can create firewall rules to 
control network access to their machine. Most operating 
systems provide a range of such mechanisms that help 
administrators enforce their security policies. 

While these protection mechanisms can enforce the 
administrator’s policy, many applications have their own 
security policies for OS-managed resources. For instance, 
an email client may want to execute suspicious attach- 
ments in isolation, without access to the user’s files; a 
networked game may want to configure a firewall to make 
sure it does not receive unwanted network traffic that 
may exploit a vulnerability; and a web browser may want 
to precisely control what files and devices (such as a 
video camera) different sites or plugins can access. Un- 
fortunately, typical OS protection mechanisms are only 
accessible to the administrator: an ordinary Unix user 
cannot allocate a new user ID, use chroot, or change 
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firewall rules, forcing applications to invent their own 
protection techniques like system call interposition [15], 
binary rewriting [30] or analysis [13, 45], or interposing 
on system accesses in a language runtime like Javascript. 


This paper presents the design of UserFS, a kernel 
framework that allows any application to use traditional 
OS protection mechanisms on a Unix system, and a proto- 
type implementation of UserFS for Linux. UserFS makes 
protection mechanisms egalitarian, so that any user—not 
just the system administrator—can allocate new user IDs, 
set up firewall rules, and isolate processes using chroot. 
By using the operating system’s own protection mecha- 
nisms, applications can avoid race conditions and ambi- 
guities associated with system call interposition [14, 43], 
can confine existing code without having to recompile or 
rewrite it in a new language, and can enforce a coherent 
security policy for large applications that might span sev- 
eral runtime environments, such as both Javascript and 
Native Client [45], or Java and JNI code. 


Allowing arbitrary users to manipulate OS protection 
mechanisms through UserFS requires addressing several 
challenges. First, UserFS must ensure that a malicious 
user cannot exploit these mechanisms to violate another 
application’s security policy, perhaps by re-using a pre- 
viously allocated user ID, or by running setuid-root pro- 
grams in a malicious chroot environment. Second, user 
IDs are often used in Unix for accountability and auditing, 
and UserFS must ensure that a system administrator can 
attribute actions to users that he or she knows about, even 
for processes that are running with a newly-allocated user 
ID. Finally, UserFS should to be compatible with existing 
applications, interfaces, and kernel components whenever 
possible, to make it easy to incrementally deploy UserFS 
in practical systems. 


UserFS addresses these challenges with a few key ideas. 
First, UserFS allows applications to allocate user IDs 
that are indistinguishable from traditional user IDs man- 
aged by the system administrator. This ensures that ex- 
isting applications do not need to be modified to support 
application-allocated protection domains, and that exist- 
ing UID-based protection mechanisms like file permis- 
sions can be reused. Second, UserFS maintains a shadow 
generation number associated with each user ID, to make 
sure that setuid executables for a given UID cannot be 
used to obtain privileges once the UID has been reused by 
anew application. Third, UserFS represents allocated user 
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IDs using files in a special file system. This makes it easy 
to manipulate user [Ds, much like using the /proc file 
system on Linux, and applications can use file descriptor 
passing to delegate privileges and implement authentica- 
tion logic. Finally, UserFS uses information about what 
user ID allocated what other user IDs to determine what 
setuid executables can be trusted in any given chroot 
environment, as will be described later. 

We have implemented a prototype of UserFS for Linux 
purely as a kernel module, consisting of less than 3,000 
lines of code, along with user-level support libraries for C 
and PHP-based applications. UserFS imposes no per- 
formance overhead for most existing operations, and 
only performs an additional check when running set- 
uid executables. We modified several applications to en- 
force security policies using UserFS, including Google’s 
Chromium web browser, a PHP-based wiki application, 
an FTP server, ssh-agent, and Unix commands like 
bash and su, all with minimal code modifications, sug- 
gesting that UserFS is easy to use. We further show that 
our modified wiki is not vulnerable by design to 5 out of 
6 security vulnerabilities found in that application over 
the past several years. 

The key contribution of this work is the first system 
that allows Linux protection and isolation mechanisms to 
be freely used by non-root code. This improves overall 
security both by allowing applications to enforce their 
policies in the OS, and by reducing the amount of code 
that needs to run as root in the first place (for example to 
set up chroot jails, create new user accounts, or config- 
ure firewall rules). 

The rest of this paper is structured as follows. Sec- 
tion 2 provides more concrete examples of applications 
that would benefit from access to OS protection mecha- 
nisms. Section 3 describes the design of UserFS in more 
detail, and Section 4 covers our prototype implementation. 
We illustrate how we modified existing applications to 
take advantage of UserFS in Section 5, and Section 6 eval- 
uates the security and performance of UserFS. Section 7 
surveys related work, Section 8 discusses the limitations 
of our system, and Section 9 concludes. 


2 MOTIVATION AND GOALS 


The main goal of UserFS is to help applications reduce 
the amount of trusted code, by allowing them to use tradi- 
tionally privileged OS protection mechanisms to control 
access to system resources, such as the file system and the 
network. We believe this will allow many applications to 
improve their security, by preventing compromises where 
an attacker takes advantage of an application’s excessive 
OS-level privileges. However, UserFS is not a security 
panacea, and programmers will still need to think about 
a wide range of other security issues from cryptography 
to cross-site scripting attacks. The rest of this section 
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provides several motivating examples in which UserFS 
can improve security. 


Avoiding root privileges in existing applications. 
Typical Unix systems run a large amount of code as root 
in order to perform privileged operations. For example, 
network services that allow user login, such as an FTP 
server, sshd, or an IMAP server often run as root in or- 
der to authenticate users and invoke setuid() to acquire 
their privileges on login. Unfortunately, these same net- 
work services are the parts of the system most exposed 
to attack from external adversaries, making any bug in 
their code a potential security vulnerability. While some 
attempts have been made to privilege-separate network 
services, such as with OpenSSH [39], it requires carefully 
re-designing the application and explicitly moving state 
between privileged and unprivileged components. By al- 
lowing processes to explicitly manipulate Unix users as 
file descriptors, and pass them between processes, UserFS 
eliminates the need to run network services as the root 
user, aS we will show in Section 5.3. 

In addition to network services, users themselves often 
want to run code as root, in order to perform currently- 
privileged operations. For instance, chroot can be useful 
in building a complex software package that has many 
dependencies, but unfortunately chroot can only be in- 
voked by root. By allowing users to use a range of mech- 
anisms currently reserved for the system administrator, 
UserFS further reduces the need to run code as root. 


Sandboxing untrusted code. Users often interact with 
untrusted or partially-trusted code or data on their com- 
puters. For example, users may receive attachments via 
email, or download untrusted files from the web. Opening 
or executing these files may exploit vulnerabilities in the 
user’s system. While it’s possible for the mail client or 
web browser to handle a few types of attachments (such 
as HTML files) safely, in the general case opening the 
document will require running a wide range of existing 
applications (e.g. OpenOffice for Word files, or Adobe 
Acrobat to view PDFs). These helper applications, even 
if they are not malicious themselves, might perform unde- 
sirable actions when viewing malicious documents, such 
as a Word macro virus or a PDF file that exploits a buffer 
overflow in Acrobat. 

Guarding against these problems requires isolating the 
suspect application from the rest of the system, while 
providing a limited degree of sharing (such as initializing 
Acrobat with the user’s preferences). With UserFS, the 
mail client or web browser can allocate a fresh user ID 
to view a suspicious file, and use firewall rules to ensure 
the application does not abuse the user’s network connec- 
tion (e.g. to send spam), and Section 5.2 will describe 
how UserFS helps Unix users isolate partially-trusted or 
untrusted applications in this manner. 
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Enforcing separation in privilege-separated applica- 
tions. One approach to building high-security applica- 
tions is to follow the principle of least privilege [40] by 
breaking up an application into several components, each 
of which has the minimum privileges necessary. For 
instance, OpenSSH [39], qmail [3], and the Chromium 
browser [2] follow this model, and tools exist to help 
programmers privilege-separate existing applications [7]. 
One problem is that executing components with less privi- 
leges requires either root privilege to start with (and appli- 
cations that are not fully-trusted to start with are unlikely 
to have root privileges), or other complex mechanisms. 
With User FS, privilege-separated applications can use 
existing OS protection primitives to enforce isolation be- 
tween their components, without requiring root privileges 
to do so. We hope that, by making it easier to execute 
code with less privileges, UserFS encourages more appli- 
cations to improve their security by reducing privileges 
and running as multiple components. As an example, Sec- 
tion 5.4 shows how UserFS can isolate different processes 
in the Chromium web browser. 


Exporting OS resources in higher-level runtimes. _ Fi- 
nally, there are many higher-level runtimes running on a 
typical desktop system, such as Javascript, Flash, Native 
Client [45], and Java. Applications running on top of 
these runtimes often want to access underlying OS re- 
sources, including the file system, the network, and local 
devices such as a video camera. This currently forces the 
runtimes to implement their own protection schemes, e.g. 
based on file names, which can be fragile, and worse yet, 
enforce different policies depending on what runtime an 
application happens to use. By using UserFS, runtimes 
can delegate enforcement of security checks to the OS 
kernel, by allocating a fresh user ID for logical protection 
domains managed by the runtime. For example, Sec- 
tion 5.1 shows how User FS can enforce security policies 
for a PHP web application. In the future, we hope the 
same mechanisms can be used to implement a coherent 
security policy for one application across all runtimes that 
it might use. 


3 KERNEL INTERFACE DESIGN 


To help applications reduce the amount of trusted code, 
UserFS allows any application to allocate new principals; 
in Unix, principals are user IDs and group IDs. An ap- 
plication can then enforce its desired security policy by 
first allocating new principals for its different components, 
then, second, setting file permissions—i.e., read, write, 
and execute privileges for principals—to match its secu- 
rity policy, and finally, running its different components 
under the newly-allocated principals. 

A slight complication arises from the fact that, in many 
Unix systems, there are a wide range of resources avail- 
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able to all applications by default, such as the /tmp direc- 
tory or the network stack. Thus, to restrict untrusted code 
from accessing resources that are accessible by default, 
UserFS also allows applications to impose restrictions on 
a process, in the form of chroot jails or firewall rules. 
The rest of this section describes the design of the UserFS 
kernel mechanisms that provide these features. 


3.1 User ID allocation 


The first function of UserFS is to allow any application 
to allocate a new principal, in the form of a Unix user 
ID. At a naively high level, allocating user IDs is easy: 
pick a previously unused user ID value and return it to the 
application. However, there are four technical challenges 
that must be addressed in practice: 


e When is it safe for a process to exercise the privi- 
leges of another user ID, or to change to a different 
UID? Traditional Unix provides two extremes, nei- 
ther of which are sufficient for our requirements: 
non-root processes can only exercise the privileges 
of their current UID, and root processes can exercise 
everyone’s privileges. 


e How do we keep track of the resources associated 
with user IDs? Traditional Unix systems largely rely 
on UIDs to attribute processes to users, to implement 
auditing, and to perform resource accounting, but if 
users are able to create new user IDs, they may be 
able to evade UID-based accounting mechanisms. 


e How do we recycle user ID values? Most Unix sys- 
tems and applications reserve 32 bits of space for 
user ID values, and an adversary or a busy system 
can quickly exhaust 2? user ID values. On the other 
hand, if we recycle UIDs, we must make sure that 
the previous owner of a particular UID cannot ob- 
tain privileges over the new owner of the same UID 
value. 


e Finally, how do we keep user ID allocations persis- 
tent across reboots of the kernel? 


We will now describe how UserFS addresses these chal- 
lenges, in turn. 


3.1.1 Representing privileges 


UserFS represents user IDs with files that we will call 
Ufiles in a special /proc-like file system that, by conven- 
tion, is mounted as /userfs. Privileges with respect to a 
specific user ID can thus be represented by file descrip- 
tors pointing to the appropriate Ufile. Any process that 
has an open file descriptor corresponding to a Ufile can 
issue a USERFS_IOC_SETUID ioctl on that file descriptor 
to change the process’s current UID (more specifically, 
euid) to the Ufile’s UID. 
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Aside from the special ioctl calls, file descriptors for 
Ufiles behave exactly like any other Unix file descriptor. 
For instance, an application can keep multiple file descrip- 
tors for different user IDs open at the same time, and 
switch its process UID back and forth between them. Ap- 
plications can also use file descriptor passing over Unix 
domain sockets to pass privileges between processes. This 
can be useful in implementing user authentication or lo- 
gin, by allowing an authentication daemon to accept login 
requests over a Unix domain socket, and to return a file 
descriptor for that user’s Ufile if the supplied credential 
(e.g. password) was correct. 

Finally, each Ufile under /userfs has an owner user 
and group associated with it, along with user and group 
permissions. These permissions control what other users 
and groups can obtain the privileges of a particular UID 
by opening its via path name. By default, a Ufile is owned 
by the user and group IDs of the process that initially 
allocated that UID, and has Unix permissions 600 (i.e. 
accessible by owner, but not by group or others), allowing 
the process that allocated the UID to access it initially. 
A process can always access the Ufile for the process’s 
current UID, regardless of the permissions on that Ufile 
(this allows a process to always obtain a file descriptor for 
its current UID and pass it to others via FD passing). 


3.1.2 Accountability hierarchy 


Ufiles help represent privileges over a particular user 
ID, but to provide accountability, our system must also 
be able to say what user is responsible for a particular 
user ID. This is useful for accounting and auditing pur- 
poses: tracking what users are using disk space, running 
CPU-intensive processes, or allocating many user IDs via 
UserFS, or tracking down what user tried to exploit some 
vulnerability a week ago. 

To provide accountability, UserFS implements a hier- 
archy of user [Ds. In particular, each UID has a parent 
UID associated with it. The parent UID of existing Unix 
users is root (QO), including the parent of root itself. For 
dynamically-allocated user IDs, the parent is the user ID 
of the process that allocated that UID (which in turn has 
its own parent UID). UserFS represents this UID hier- 
archy with directories under /userfs, as illustrated in 
Figure 1. For convenience, UserFS also provides sym- 
bolic links for each UID under /userfs that point to the 
hierarchical name of that UID, which helps the system 
administrator figure out who is responsible for a particular 
UID. 

In addition to the USERFS_IOC_SETUID ioctl that was 
mentioned earlier, UserFS supports three more opera- 
tions. First, a process can allocate new UIDs by issuing a 
USERFS_IOC_ALLOC ioctl on a Ufile. This allocates a new 
UID as a child of the Ufile’s UID, and the value of the 
newly allocated UID is returned as the result of the ioctl. 
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A process can also de-allocate UIDs by performing an 
rmdir on the appropriate directory under /userfs. This 
will recursively de-allocate that UID and all of its child 
UIDs (i.e. it will work even on non-empty directories), 
and kill any processes running under those UIDs, for rea- 
sons we will describe shortly. Finally, a process can move 
a UID in the hierarchy using rename (for example, if 
one user is no longer interested in being responsible for 
a particular UID, but another user is willing to provide 
resources for it). 

Finally, accountability information may be important 
long after the UID in question has been de-allocated (e.g. 
the administrator wants to know who was responsible for 
a break-in attempt, but the UID in the log associated with 
the attempt has been de-allocated already). To address 
this problem, UserFS uses syslog to log all allocations, so 
that an administrator can reconstruct who was responsible 
for that UID at any point in time. 


3.1.3 UID reuse 


An ideal system would provide a unique identifier to ev- 
ery principal that ever existed. Unfortunately, most Unix 
kernel data structures and applications only allocate space 
for a 32-bit user ID value, and an adversary can easily 
force a system to allocate 2°* user IDs. To solve this 
problem, UserFS associates a 64-bit generation number 
with every allocated UID!, in order to distin guish between 
two principals that happen to have had the same 32-bit 
UID value at different times. The kernel ensures that gen- 
eration numbers are unique by always incrementing the 
generation number when the UID is deallocated. How- 
ever, aS we just mentioned, there isn’t enough space to 
store the generation number along with the user ID in 
every kernel data structure. UserFS deals with this on a 
case-by-case basis: 


Processes. UserFS assumes that the current UID of a 
process always corresponds to the latest generation num- 
ber for that UID. This is enforced by killing every process 
whose current UID has been deallocated. 


Open Ufiles. UserFS keeps track of the generation num- 
ber for each open file descriptor of a Ufile, and veri- 
fies that the generation number is current before pro- 
ceeding with any ioctl on that file descriptor (such as 
USERFS_IOC_SETUID). Once a UID has been reused, the 
current UID generation number is incremented, and left- 
over file descriptors for the old Ufile will be unusable. 
This ensures that a process that had privileges over a UID 
in the past cannot exercise those privileges once the UID 
is reused. 


'Tt would take an attacker thousands of years to allocate 2©4 UIDs, 
even at arate of | million UIDs per second. 
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Path name Role 





/userfs/ctl 
/userfs/1001/ct1 
/userfs/1001/5001/ctl 
/userfs/1001/5001/5002/ct1 | Ufile for user 5003 (allocated by parent UID 5001). 


/userfs/1001/5003/ct1 
/userfs/1002/ctl 
/userfs/5001 
/userfs/5002 
/userfs/5003 


Ufile for root (UID 0). 
Ufile for user 1001 (parent UID 0). 
Ufile for user 5001 (allocated by parent UID 1001). 


Ufile for user 5003 (allocated by parent UID 1001). 
Ufile for user 1002 (parent UID 0). 

Symbolic link to 1901/5001. 

Symbolic link to 1001/5001/5002. 

Symbolic link to 1001/5003. 


Figure 1: An overview of the files exported via UserFS in a system with two traditional Unix accounts (UID 1001 and 1002), and three dynamically- 
allocated accounts (5001, 5002, and 5003). Not shown are system UIDs that would likely be present on any system (users such as bin, nobody, etc), 
or directories that are implied by the ctl files. Each ctl file supports two ioctls: USERFS_IOC_SETUID and USERFS_IOC_ALLOC. 


Setuid files. Setuid files are similar to a file descriptor 
for a Ufile, in the sense that they can be used to gain the 
privileges of a UID. To prevent a stale setuid file from 
being used to start a process with the same UID in the 
future, UserFS keeps track of the file owner’s UID gener- 
ation number for every setuid file in that file’s extended 
attributes. (Extended attributes are supported by many file 
systems, including ext2, ext3, and ext4. Moreover, small 
extended attributes, such as our generation number, are 
often stored in the inode itself, avoiding additional seeks 
in the common case.) UserFS sets the generation number 
attribute when the file is marked setuid, or when its owner 
changes, and checks whether the generation number is 
still current when the setuid file is executed. 


Non-setuid files, directories, and other resources. 
UserFS does not keep track of generation numbers for the 
UID owners of files, directories, system V semaphores, 
and so on. The assumption is that it’s the previous UID 
owner’s responsibility to get rid of any data or resources 
they do not want to be accessed by the next process that 
gets the same UID value. This is potentially risky, if sen- 
sitive data has been left on disk by some process, but is 
the best we have been able to do without changing large 
parts of the kernel. 

There are several ways of addressing the problem of 
leftover files, which may be adopted in the future. First, 
the on-disk inode could be changed to keep track of the 
generation number along with the UID for each file. This 
approach would require significant changes to the ker- 
nel and file system, and would impose a minor runtime 
performance overhead for all file accesses. Second, the 
file system could be scanned to find orphaned files, much 
in the same way that UserFS scans the process table to 
kill processes running with a deallocated UID. This ap- 
proach would make user deallocation expensive, although 
it would not require modifying the file system itself. Fi- 
nally, each application could run sensitive processes with 
write access to only a limited set of directories, which can 
be garbage-collected by the application when it deletes 
the UID. Since none of the approaches are fully satis- 
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factory, our design leaves the problem to the application, 
out of concern that imposing any performance overheads 
or extensive kernel changes would preclude the use of 
UserFS altogether. 


3.1.4 Persistence 


UserFS must maintain two pieces of persistent state. First, 
UserFS must make sure that generation numbers are not 
reused across reboot; otherwise an attacker could use a 
setuid file to gain another application’s privileges when 
a UID is reused with the same generation number. One 
way to achieve this would be to keep track of the last 
generation number for each UID; however this would 
be costly to store. Instead, UserFS maintains generation 
numbers only for allocated UIDs, and just one “next” 
generation number representing all un-allocated UIDs. 
UserFS increments this next generation number when any 
UID is allocated or deallocated, and uses its current value 
when a new UID 1s allocated. To ensure that generation 
numbers are not reused in the case of a system crash, 
UserFS synchronously increments the next generation 
number on disk. As an important optimization, UserFS 
batches on-disk increments in groups of 1,000 (..e., it only 
update the on-disk next generation number after 1,000 
increments), and it always increments the next generation 
counter by 1,000 on startup to account for possibly-lost 
increments. 

Second, UserFS must allow applications to keep using 
the same dynamically-allocated UIDs after reboot (e.g. 
if the file system contains data and/or setuid files owned 
by that UID). This involves keeping track of the genera- 
tion number and parent UID for every allocated UID, as 
well as the owner UID and GID for the corresponding 
Ufile. UserFS maintains a list of such records in a file 
(/etc/userfs_uid), as shown in Figure 2. The permis- 
sions for the Ufile are stored as part of the owner value (if 
the owner UID or GID is zero, the corresponding permis- 
sions are O, and if the owner UID or GID 1s non-zero, the 
corresponding permissions are read-write). The genera- 
tion numbers of the parent UID, owner UID, and owner 
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GID are not tracked; the parent UID is necessarily current 
(otherwise this child would have been deallocated), and 
the owner UID and GID are left up to the Ufile owner. 

UserFS lazily updates this on-disk data structure; dele- 
tion is implemented in-place by setting the UID value to 
—l. If an application wants to rely on the Ufile being 
present after reboot, it can force that Ufile’s persistent 
record to be written to disk by issuing an fsync on the 
Ufile’s file descriptor. 

As an optimization, UserFS also allows non-persistent 
UIDs to be allocated (for isolating processes that do not 
store any persistent data in the file system under their 
UID). To implement this, the USERFS_IOC_ALLOC ioctl 
takes one argument that indicates whether the new UID 
should be persistent or not; persistent UIDs can only be 
allocated to persistent parents. 

As a practical matter, UserFS partitions the 32-bit UID 
space into UIDs reserved for system use (0 through 22° — 
1), persistent dynamically-allocated UIDs (2°° through 
23! _ 1), non-persistent dynamically-allocated UIDs (2?" 
through 2°! + 22° — 1), and more reserved UIDs (22! + 22° 
through 2°? — 1). This makes it easy to determine whether 
a particular UID is persistent, and avoids conflicts with 
most system-allocated UIDs at either end of the UID 
number space. UserFS provides modified adduser and 
deluser programs that create and delete Ufiles when 
they add or remove users from the system (to allow those 
users to allocate new UIDs via ioctls on their Ufile), and 
assumes that the system administrator will not use UIDs 
in the dynamically-allocated range. 


3.2 Restriction mechanisms 


To prevent malicious code from accessing resources that 
are accessible to everyone by default (such as /tmp or the 
network), UserFS allows applications to take advantage of 
existing restriction mechanisms: chroot to limit access 
to the file system namespace, and firewall rules to limit 
access to the network. 


3.2.1 File system namespaces 


To prevent processes from accessing files that are accessi- 
ble by default, UserFS allows any user to invoke chroot. 
There are two potential problems associated with this: 
setuid programs that will behave incorrectly in a chroot 
environment, and arbitrary programs attempting to escape 
from a chroot jail by recursive use of chroot itself. 


Setuid programs. If a setuid program runs in a chroot 
environment, it can behave in unpredictable ways—for in- 
stance, a setuid-root su program may read a user-supplied 
/etc/passwd file and grant the caller root access be- 
cause it assumed that root’s password in its version of 
/etc/passwd was authentic. UserFS relies on the user 
ID hierarchy to address this problem. In particular, after 
user U calls chroot, UserFS will only honor setuid bits 
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for files owned by UIDs that are descendants of U. In 
the corner case of root invoking chroot, every user is a 
descendant of root, and thus every setuid program will 
still be honored, as on a regular Linux system. 

UserFS only keeps track of the last UID to call chroot 
for a given process (inherited across fork). If one user 
performs chroot inside a second user’s jail, it is the re- 
sponsibility of the first user to verify that it’s creating a 
chroot environment acceptable to all of its descendants. 
In practice, we expect that the first user will be a descen- 
dant of the second user (because he is executing inside 
the second user’s jail), so this requirement will not pose 
significant problems. 


Escaping chroot. The Linux chroot mechanism 
works by effectively maintaining a single “barrier” at 
the specified root directory that prevents the process from 
evaluating .. (parent directory) of that process’s root di- 
rectory. A process can escape a chroot jail by obtaining 
a reference (either a file descriptor or current working 
directory) to a directory outside the chroot’ed hierarchy, 
and using that reference to walk up the . . pointers to the 
true file system root. Even if an application properly uses 
chroot to confine a process, the kernel only keeps track 
of one root directory pointer per process, so a malicious 
process in a chroot jail could confine itself to a second 
chroot jail while maintaining a handle on a directory 
outside this second jail, and use that handle to escape 
both jails. 

To prevent this problem, UserFS enforces three rules 
for chroot invoked by non-root users. First, to ensure 
a process cannot maintain a current working directory 
outside the chroot environment, UserFS requires that 
chroot callers set their directory to the chroot target 
directory ahead of time. Second, UserFS checks that a 
process calling chroot has no open directory file descrip- 
tors. Finally, UserFS ensures that a process cannot receive 
a directory file descriptor via file descriptor passing from 
outside the jail: it annotates Unix domain sockets with the 
sender’s root directory (or a “prohibited” value if there 
are senders with different root directories) on sendmsg, 
and checks that the sender’s root directory matches the re- 
cipient process root directory on recvmsg, if the message 
contains a directory file descriptor. 


3.2.2 Firewall rules 


Ideally, we would like users to be able to run a process 
with a set of firewall rules attached to it, and for those 
firewall rules to apply to any child processes spawned by 
that process, much in the same way that chroot applies 
to all child processes. Unfortunately, this would require 
changing the core Linux kernel: at the very least, it would 
be necessary to track the “current firewall ruleset’ for each 
process. Since we wanted to implement UserFS purely 
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Figure 2: Record stored by UserFS on disk for each allocated UID, totaling 24 bytes per allocated UID. 


in terms of loadable kernel modules, we compromised, 
and associated firewall rules with UIDs instead. The 
kernel already keeps track of the UID for each process, 
and propagates the UID to the children of that process, 
so UserFS simply needs to ensure that firewall rules for 
newly-allocated UIDs inherit the firewall rules for the 
parent UID. 


UserFS’s firewall system consists of rules, which form 
rulesets, which are in turn associated with UIDs. At the 
lowest level, rules are of the form (action, proto, address, 
netmask, port). Our prototype supports two kinds of 
actions, ALLOW and BLOCK, and two protocols, TCP 
and UDP. The protocol, address, netmask, and port are 
matched against the destination of outgoing packets or the 
source of incoming packets; port value 0 matches any port. 
Supporting just TCP and UDP protocols suffices because, 
on Linux, a non-root process cannot open a raw socket 
to send arbitrary packets that are neither TCP or UDP. 
For kernels that support other protocols, such as SCTP, 
UserFS’s rules could be augmented to track additional 
protocols. 


A ruleset is an ordered sequence of rules, used to de- 
termine whether a packet should be allowed or blocked. 
When checking a packet against a ruleset, UserFS finds 
the earliest rule in the ruleset that matches the packet, and 
uses that rule’s action to determine if the packet should 
be allowed or blocked. Each ruleset contains two implicit 
rules at the end, (ALLOW, TCP, 0.0.0.0, 0.0.0.0, 0) and 
(ALLOW, UDP, 0.0.0.0, 0.0.0.0, 0), which allow any pack- 
ets by default. Each UID is associated with a ruleset, and 
applications can modify that UID’s ruleset by adding or 
removing rules as necessary. 


One potential worry in associating rulesets with a UID 
is that a malicious process can create a child UID with 
less-restrictive firewall rules. To mitigate this problem, 
UserFS checks not only the UID’s own firewall ruleset, 
but also the rulesets of all parent UIDs, and only allows 
packets if they are allowed by every ruleset in this chain. 


UserFS provides a Ufile ioctl to add or remove rules 
from that UID’s firewall ruleset. However, there is a slight 
complication: on the one hand, we want to ensure that 
a process cannot modify its own firewall ruleset, but on 
the other hand, a process can always open its own Ufile. 
To address this problem, UserFS allows the firewall ioctl 
to be invoked only by the parent UID of a Ufile. This 
ensures that a process cannot change firewall rules for 
itself through its own Ufile. 
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4 IMPLEMENTATION 


We have implemented UserFS as a kernel module for 
version 2.6.31 of the Linux kernel. The UserFS kernel 
module comprises a little less than 3,000 lines of code, 
excluding unit tests and the user-space mount .userfs 
command. UserFS relies heavily on the LSM frame- 
work [44] for checking generation numbers on setuid 
files (using file_permission and inode-setattr hooks), 
for confining chroot processes (using socket_sendmsg 
and socket_recvmsg hooks), and on netfilter for imple- 
menting network filtering (using NF_INET_LOCAL_IN and 
NF_INET_LOCAL_OUT hooks). UserFS also adds support 
to allow a process to chown or chgrp files between dif- 
ferent UIDs that the process has privileges over. 

Because UserFS is implemented as a kernel module, 
and does not modify core kernel code, it makes some 
trade-offs. For example, the kernel’s versions of chown, 
chgrp, and chroot are not flexible enough for UserFS 
to implement its desired security policy from a kernel 
module. As a workaround, UserFS provides ioctls that 
implement equivalent functionality with its own secu- 
rity policy. Integrating UserFS into the core kernel code 
would both simplify our implementation and offer a more 
coherent interface to applications. 

We have also implemented helper libraries for applica- 
tions using UserFS, for both C and PHP. The C library 
comprises about 1,500 lines of code, including functions 
to execute a program in a newly-allocated jail and under a 
fresh user ID, to fork with a new UID, and to manipulate 
user IDs. The C library is careful to open all Ufiles with 
the O_CLOEXEC flag to avoid accidentally leaking Ufile 
file descriptors to other processes. The PHP library adds 
about 600 more lines on top of the C library to allow PHP 
applications to manipulate Ufiles. 


5 APPLYING USERFS 


To illustrate how UserFS would be used in practice, we 
modified several applications to take advantage of UserFS, 
including the Chromium web browser, the DokuWiki 
web application, Unix command-line utilities, and an 
FTP server. The rest of this section reports on these 
applications, focusing on the changes we had to make to 
each application in order to use UserFS, and the resulting 
benefits from doing so. 


5.1 DokuWiki 


Many web applications implement their own protection 
mechanisms, since they do not typically run as root, and 
thus cannot allocate user IDs for each application-level 
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user. This can lead to vulnerabilities if the application de- 
velopers make a mistake in performing security checks [9]. 
To show how UserFS can prevent similar problems, we 
modified DokuWiki [10], a wiki application written in 
PHP that supports read-protected and write-protected 
pages [11] and that stores wiki pages in the server’s file 
system, to enforce the protection of wiki pages using file 
system permissions. 


Our modified version of DokuWiki allocates a separate 
UID for each wiki user, and sets Unix permissions on 
wiki page files to reflect the protection of that page (we 
use ACL support in the ext4 file system [19] to repre- 
sent ACLs that involve multiple users). To minimize the 
amount of damage that an attacker can do, our modified 
version of DokuWiki executes each HTTP request in a 
separate process, and allocates a new ephemeral user ID 
for the initial processing of each request”. If an HTTP 
request provides the correct password for a user account, 
the DokuWiki PHP process handling that request can ob- 
tain a file descriptor for that user’s Ufile, and change its 
UID to that user, by using the UserFS PHP module. This 
in turn allows a DokuWiki process to read or write wiki 
pages accessible to that user. Figure 3 shows the flow of 
an HTTP request in our modified DokuWiki. 


One of the key parts of our modified DokuWiki is the 
login mechanism, which allows the DokuWiki process 
to obtain a file descriptor to a user’s Ufile if it knows 
the user’s password. We implemented this mechanism 
in a short C program called dokusu. dokusu accepts a 
username and password on stdin, checks the username 
and password against the password database, and if the 
password matches, it opens the corresponding user’s Ufile 
(listed in the password database) and uses file descriptor 
passing to pass it back to the caller via stdout (which 
the caller should have set up as a Unix domain socket). 
dokusu is typically installed as a setuid program with the 
administrator’s UID, and the permissions on all Ufiles 
for DokuWiki users in /userfs and on the password 
database are such that only the administrator can access 
them. Thus, to authenticate, DokuWiki spawns dokusu, 
passes it the username and password from the HTTP 
request, and waits for a Ufile in response. 


DokuWiki keeps a copy of the user’s password in its 
HTTP cookie, which makes it easy to authenticate sub- 
sequent requests. Cookies that store a session ID could 
also be supported, by augmenting dokusu to keep track 
of all currently valid session IDs and the corresponding 
user IDs for each session, and to accept a valid session ID 
as credentials for the corresponding user. 


*We changed the first line of DokuWiki’s PHP files to allocate a 
new ephemeral UID for each request, and to switch to that user ID. An 
alternative approach would be to modify the web server to launch each 
CGI script under a fresh user ID. 
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Making these changes to DokuWiki involved adding 
approximately 80 lines of PHP code, and implementing 
the 160-line dokusu program, on top of our User-FS PHP 
and C libraries, respectively. These changes allow the ker- 
nel to enforce DokuWiki’s security policy, and Section 6.2 
shows the effectiveness of this technique. 


5.2 Command-line tools 


To make it easy for ordinary users to use UserFS, we 
implemented a command to allocate a new user ID, called 
ualloc, which simply issues USERFS_IOC_ALLOC on the 
Ufile of the current process UID and prints the resulting 
UID value. To allow users to run code with these newly 
allocated UIDs, we modified su to allow users to be spec- 
ified by their Ufile pathname instead of by username (in 
which case su relies on Ufile permissions to check if the 
caller is allowed to run as the target user, since it has no 
way of authenticating UserFS users by password). These 
modifications comprised approximately 300 lines of code. 

With these changes, users can easily run arbitrary Unix 
applications with fewer privileges. For example, if a 
user wants to run a peer-to-peer file sharing program, but 
wants to avoid the risk of that program sharing private 
files with the rest of the world, the user can simply run 
ualloc to create a fresh UID for that program, run su 
/userfs/newuid/ctl to open a shell running as that 
user ID, and run the file sharing program from that shell. 
The file sharing program will not be able to read any of the 
user’s private files (1.e., files that are not world-readable). 

Users can also create processes that are isolated from 
the user’s own account. For instance, ssh-agent stores 
a decrypted version of the user’s SSH private key in mem- 
ory. If an attacker compromises the user’s account and 
finds a running ssh-agent process, the attacker can ex- 
tract the key from memory by debugging ssh-agent. 
To prevent this, a user can allocate a fresh user ID 
with ualloc, run ssh-agent as that user ID, change 
permissions on the agent’s socket so that the user can 
talk to ssh-agent’, and finally change the owner of 
ssh-agent’s Ufile to ssh-agent’s UID, so that the user 
can no longer access it. The only thing the user can do 
at this point is to communicate with ssh-agent via the 
socket, or kill ssh-agent by deallocating the UID. The 
user cannot access ssh-agent’s memory to extract the 
key, since ssh-agent is running under a different UID, 
and the user cannot gain that UID’s privileges, because it 
cannot open the corresponding Ufile. 

Finally, UserFS makes it easier for users to switch user 
IDs. With traditional su, the user receives a new shell run- 
ning under the target UID, with a new working directory, 
new command history, and new environment variables. 
When the user wants to switch back to their original UID, 


>We had to make a two-line change to ssh-agent to support this, 
since by default ssh-agent refuses connections from other UIDs. 
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Figure 3: Flow of an HTTP request in our modified version of DokuWiki, showing Alice trying to write to two protected pages. Bold labels show 
process names (httpd, php, and dokusu). Italic labels show process UIDs (www-data, anonymous, admin, and 5009). After reading the users file, 
dokusu checks the supplied password against the stored password. In this example, Alice can modify page 1 (to which she has read-write access), but 
cannot modify page 2 (to which she has read-only access). In practice, Alice’s UID would be a value between 2°" and 23+ — 1, instead of 5009. 


they again lose their command history and environment 
variables. To show how UserFS can help, we modified 
Su to support an option to pass the resulting Ufile back 
to the caller via FD passing, instead of running a shell 
under the resulting user’s UID, and likewise modified 
bash to accept the Ufile FD from su (much like the de- 
sign of dokusu in the previous subsection) and invoke 
USERFS_IOC_SETUID on it. This allows the user to switch 
UIDs without having to switch shell processes, improving 
user convenience. 


5.3. User authentication 


Many network services run as root in order to authenti- 
cate users and to invoke setuid to switch to that user’s 
UID afterwards. Unfortunately, these network services 
are also some of the most vulnerable components in a 
system, since they are directly exposed to an attacker’s 
inputs from the network, and if they are compromised, 
the attacker gains root access. With UserFS, network 
services like ftp, ssh, telnet, or IMAP mail servers can 
instead run as completely unprivileged processes*, and 
perform authentication and login via Unix domain sock- 
ets like in DokuWiki above. (Infact, they can reuse the 
su command from the previous subsection, which passes 
back the authenticated user’s Ufile to the caller.) This en- 


4We provide setuid-root binaries to open specific TCP ports below 
1024, such as port 80 for the web server, accessible only to the web 
server’s UID. 
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sures that if an attacker finds a vulnerability in a network 
service, they get almost no privileges on the system. To 
prevent an attacker from subverting subsequent connec- 
tions to a compromised service, a new service process 
should be forked, with a fresh non-persistent UID, for 
each connection. 

To show this is feasible, we modified the Linux NetKit 
FTP server [22] to authenticate users using Ufile passing; 
doing this required 50 lines of code, indicating that it is rel- 
atively easy to make such changes to existing applications 
(unlike privilege separation in the style of OpenSSH [39], 
which is much more invasive). Our modified FTP server 
uses the su program as its authentication agent. 


5.4 Chromium browser 


One application that is already broken up into many pro- 
cesses 1s Google’s Chromium browser [2], which main- 
tains a separate process for rendering each browser win- 
dow, and a single browser kernel process responsible for 
coordinating with the rendering processes. This architec- 
ture easily lends itself to privilege separation, by isolating 
each rendering process. Indeed, Chromium already tries 
to do this on Windows using tokens [17], although this 
does not prevent a compromised browser process from 
accessing the network or world-accessible files. 

With UserFS, browser processes can be isolated by 
allocating a fresh non-persistent UID for each render- 
ing process, chrooting the rendering process into an 
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empty directory, and setting up firewall rules that block 
all network traffic. Making these changes to Chromium 
required replacing the fork call in Chromium with a call 
to a UserFS library function called ufork that performs 
precisely the actions mentioned above>. All communica- 
tion between the browser kernel process and the rendering 
processes happens via sockets, which remain intact, while 
the kernel’s protection mechanisms ensure that a compro- 
mised rendering process cannot access any files, signal 
any processes, or use the network. 


6 EVALUATION 


To evaluate UserFS, we first discuss its security, then 
show how UserFS helps prevent attackers from exploit- 
ing vulnerabilities in DokuWiki, and then measure the 
performance overheads associated with UserFS. 


6.1 Kernel security 


The goal of UserFS is to allow any application to use 
the kernel’s protection mechanisms. This implicitly as- 
sumes that the kernel’s mechanisms are secure. While 
security vulnerabilities are found in the kernel from time 
to time [1], this paper does not attempt to tackle this 
problem, and assumes that, for the time being, users will 
continue to run applications on the Linux kernel. 

Thus, we mostly focus on the security of any changes 
that UserFS makes to the Linux kernel. As a first-order 
measure, UserFS is relatively small—less than 3,000 lines 
of code—which simplifies the job of auditing our code. 
The specific mechanisms that UserFS provides that could 
be misused by adversaries are the USERFS_IOC_SETUID 
ioctl, allowing a process to switch user IDs, and the 
chroot mechanism that allows non-root processes to 
change their root directory. 

We believe the USERFS_IOC_SETUID mechanism is se- 
cure because it only allows a process to switch user IDs 
if it has an open file descriptor to the corresponding Ufile. 
By default, each standard user’s Ufile can only be opened 
by that user (and by root), making it no different from the 
current kernel policy. Users can change permissions on 
Ufiles to allow other processes to open them, but again, 
a process can only change permissions on a Ufile that 
they already have access to (1.e. it was initially their UID, 
or it was granted to them). Applications can potentially 
make mistakes and leak privileges over a Ufile to another 
process by forgetting to close a Ufile file descriptor. The 
UserFS library tries to mitigate this by opening all Ufiles 
with the O_CLOEXEC flag. 

The chroot mechanism could potentially be used re- 
cursively by an adversary to escape from a chroot jail. We 
believe that we have implemented sufficient safeguards 





>We do not provide a more fine-grained lines of code measure for 
the ufork function because it internally relies on most of the other 
functions provided by the UserFS library. 
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against this, as described in Section 3.2.1, but we have no 
formal proof of their correctness. 


6.2 Application security 


Assuming UserFS and the Linux kernel are secure, we 
wanted to show what security benefits applications could 
extract from this. To do so, we decided to check whether 
any previously-reported vulnerabilities for DokuWiki 
would have been prevented by our changes to enforce 
the DokuWiki security policy using file system permis- 
sions. We found several vulnerabilities for DokuWiki in 
the past few years that allowed an attacker to compromise 
DokuWiki [32-37] (as opposed to information disclosure 
vulnerabilities, such as printing PHP debug information, 
which might help an attacker in exploiting another attack 
vector). 

Our modified version of DokuWiki (backported to an 
older version of DokuWiki that contained the above vul- 
nerabilities) was able to prevent exploits of code injec- 
tion [35-37], directory traversal [33], and insufficient 
permission check [34] vulnerabilities (5 out of 6), but did 
not prevent exploits of a cross-site request forgery vulner- 
ability [32]. Although our modified version of DokuWiki 
contained all of the above vulnerabilities, the vulnerable 
code was running with limited privileges (either the web 
server’s ephemeral per-request UID, or the UID of a spe- 
cific wiki user), which prevented the attack from doing 
any server-side damage. 


6.3 Performance 


Performance of applications running on Linux with 
UserFS depends on two factors: overheads imposed 
by UserFS on system calls, and overheads associated 
with privilege-separating the application to make use of 
UserFS. In most cases, UserFS imposes no overheads 
on system calls, because the kernel executes the same 
exact access control checks based on UIDs with or with- 
out UserFS. One exception to this is the invocation of 
setuid binaries, for which UserFS checks the generation 
number of the setuid binary against the latest generation 
number for that UID. Applications that are modified to 
take advantage of UserFS incur two additional sources of 
overhead: the cost to invoke UserFS mechanisms, such as 
ioctls to allocate or change UIDs, and the cost of privilege- 
separating the application into separate Unix processes. 
To evaluate these three sources of overhead, we used 
microbenchmarks to measure the cost of system calls af- 
fected by UserFS, and we used DokuWiki to measure the 
cost of privilege-separating an application with UserFS. 
Figure 4 shows the results of these experiments on a 
2.8GHz Intel Core 17 system with 8GB RAM running 
a 64-bit Linux 2.6.31 kernel. As can be seen from the 
figure, UserFS imposes minimal overheads for both user 
allocation and for checking generation numbers on setuid 
binaries (which is dwarfed by the cost of forking a setuid 
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Operation 


Time without UserFS | Time with UserFS 





Allocate UID 

Check generation number of setuid executable 
Run sudo ls 

Fetch page from DokuWiki 


— 0.022 ms 

0 0.003 ms 

10.943 ms 10.946 ms 
45 ms 61 ms 


Figure 4: Time taken to perform several operations with and without UserFS. 


program in the first place). In the case of DokuWiki, the 
performance overhead of privilege separation is largely 
dominated by the cost of spawning the dokusu authen- 
tication agent; we expect that having a long-running au- 
thentication agent that accepts requests over Unix domain 
sockets would significantly reduce the cost of running 
DokuWiki with UserFS. However, the costs of privilege- 
separation are not specific to UserFS, and have been stud- 
ied before extensively [2, 3, 5—7, 24, 26, 39]. 


7 RELATED WORK 


The principle of least privilege [40] is generally recog- 
nized as a good strategy for building secure systems, and 
has been used by many applications in practice, including 
qmail [3], OpenSSH [39], OKWS [24], a number of web 
browsers [2, 18, 41], and others. Current Unix protection 
mechanisms make it difficult for non-root applications 
to follow the principle of least privilege, by not allowing 
them to create less-privileged principals. This requires 
developers that want less privileges to actually have more 
privileges by running as root, and UserFS directly ad- 
dresses this problem. 

It is well-known that reasoning about the safety of a 
computer system in the presence of setuid programs is 
difficult [21, 27], and there are many pitfalls in imple- 
menting safe setuid programs [4, 8]. At the lowest level, 
UserFS does not make it any easier to write a correct 
setuid program. However, we hope that UserFS makes it 
possible for programs that currently run as root, including 
setuid-root programs, to run under a less privileged UID 
instead, mitigating the damage from any vulnerability. 

Krohn argued that applications must be given mecha- 
nisms to reduce their privileges [25], and ServiceOS [42] 
similarly argues for support for application-level prin- 
cipals in the OS kernel. Capability-based systems like 
KeyKOS [6, 20], and DIFC systems like Asbestos [12] 
and HiStar [46], allow users to create new protection do- 
mains of their own, at the cost of requiring a new OS 
kernel. Flume [26] shows how these ideas can be 1m- 
plemented on top of a Linux kernel to avoid the cost of 
re-implementing a new OS kernel, but Flume does not 
allow users to apply its protection mechanisms to unmod- 
ified existing applications. UserFS shows how the idea 
of egalitarian protection mechanisms can be realized in 
a standard Linux kernel, in a way that cleanly applies 
to most existing applications, and achieves many of the 
goals suggested by Krohn [25] and Wang [42]. 
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The use of Ufile file descriptors to represent privileges 
over UIDs is inspired by capability systems [28]. Unlike 
traditional capability systems, which use capabilities to 
control access to all resources, UserFS only uses file 
descriptors to track the set of Ufiles currently held open 
by a process, and to pass Ufiles between processes. Initial 
access to Ufiles for opening the file descriptor, as well as 
access to all other resources, is controlled by Unix file 
permissions and other Unix mechanisms. One common 
problem facing capability systems is revocation of access. 
UserFS uses generation numbers to ensure that, once a 
UID has been reused, leftover file descriptors cannot gain 
access to that UID, since their generation numbers do not 
match the UID’s generation number. 

Although current Unix protection mechanisms are not 
egalitarian, many systems have used them to achieve priv- 
ilege separation, at the cost of requiring some part of their 
system to run as root. For example, OK WS [24] shows 
how to build a privilege-separated web server by running 
a launcher as root, and Android [16] similarly uses Linux 
user IDs to isolate different applications on a cell phone. 
If these platforms start running increasingly more com- 
plex applications inside them, those applications will not 
have the benefit of running as root and creating their own 
protection domains. UserFS would address this problem. 

Similarly, there have been a number of tools that help 
programmers privilege-separate their existing applica- 
tions [5, 7, 39]. The resulting privilege-separated applica- 
tions often require root privileges to actually set up protec- 
tion domains, and UserFS could be used in conjunction 
with these tools to run privilege-separated applications 
without root access. 

System call interposition [15] could, in principle, im- 
plement any policy that a kernel could implement. By 
relying on the kernel’s protection mechanisms, UserFS 
avoids some of the pitfalls associated with system call 
interposition [14] and avoids runtime overhead for most 
operations. More importantly, UserFS illustrates what 
interface could be used by applications to allocate and 
manage their protection domains and set policies; the 
same interface could be implemented by a system call 
interposition system. 

Bittau et al [5] propose a new kernel abstraction called 
an sthread that can execute certain pieces of an applica- 
tion’s code in isolation from the rest of that application. 
The key contribution of sthreads was in providing a mech- 
anism that has relatively low overhead for fine-grained 
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isolation of process memory, and that can be used by any 
processes in the system. UserFS, on the other hand, pro- 
vides persistent UIDs that can be used to control access to 
data in the file system, and to control interactions between 
multiple processes in an operating system. 

The Linux kernel supports several security mechanisms 
in addition to traditional user ID protection, such as 
SELinux [29] and Linux-vserver [38], but none of these 
mechanisms allow users to create their own protection do- 
mains and use them to protect system resources like files 
and devices. One protection mechanism that is available 
to users on Linux is running code in a virtual machine 
such as gemu. Unfortunately, this is often too coarse- 
grained and heavy-weight for most applications. 

Taint tracking in an operating system can be used to 
implement certain application-level security policies; for 
example, SubOS [23] shows how this can be implemented 
on OpenBSD. Unfortunately, these mechanisms are much 
more invasive and impose more runtime overhead than 
UserFS, which simply exposes existing mechanisms in 
the OS kernel. 

The protection mechanisms in Windows differ from 
those found in Unix systems. Windows protection is cen- 
tered around the notion of tokens [31]. Users can create 
tokens that grant almost no privileges, and this is used 
by applications such as Chromium to sandbox untrusted 
code [17]. However, there is no way to create tokens 
with a fresh user ID (without administrative privileges to 
create a new user), which makes it difficult to implement 
controlled sharing of system resources (as opposed to 
complete isolation in a sandbox). Windows tokens can be 
passed between processes, similar to how UserFS allows 
passing file descriptors for Ufiles. The Windows firewall 
allows associating firewall rules with executables. UserFS 
associates firewall rules with user IDs, and inherits fire- 
wall rules on user ID creation, which ensures that a user 
cannot escape firewall rules by creating and running a 
new executable. 


$8 LIMITATION AND FUTURE WORK 


While UserFS helps applications run code with fewer priv- 
ileges, it is not a panacea. Running untrusted code on a 
system often exposes a wider range of possibly-vulnerable 
interfaces than if we were simply interacting with the at- 
tacker over the network. For example, an attacker may 
try to exploit bugs in the kernel or in other applications 
running on the same machine. Nonetheless, if it is neces- 
sary to run untrusted or partially-trusted applications on a 
machine, UserFS helps improve security with respect to 
system resources. 

UserFS, much like Linux itself, currently assumes that 
all file systems are always mounted on the same machine, 
and does not have a plan for translating UIDs from a 
file system that was originally mounted on a different 
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machine. One possible approach to dealing with this 
problem may be to maintain a globally unique name of 
each UID (perhaps a public key), and to store on each file 
system a mapping table between file system UIDs and the 
globally unique names for those UIDs. 


When a user ID is deallocated, it may be difficult to 
remove non-empty directories owned by that UID in the 
file system without root’s intervention. While we have not 
yet implemented a solution to this problem, we imagine a 
system call or a setuid-root program that, upon request, 
recursively garbage-collects files or sub-directories owned 
by de-allocated UIDs from a given directory, as long as 
the caller has write permission on that directory. 


UserFS only protects resources managed by the oper- 
ating system, such as files, processes, and devices. Web 
applications often use databases to store their data, which 
UserFS cannot protect directly. In the future, we hope to 
explore the use of OS UIDs in a database to implement 
protection of data at a finer granularity (perhaps at the 
row level). 


Our current prototype allocates user IDs, but does 
not separately allocate group IDs. We believe it is best 
to have only one kind of dynamically allocated princi- 
pal, such as the 32-bit integer called the UID in UserFS. 
These principals can then be used to represent either users 
or groups, depending on the application’s requirements. 
The GID and grouplist associated with every Unix pro- 
cess could then be used to represent a process that has 
the privileges of multiple principals at once. To sup- 
port this, UserFS could provide a USERFS_ITOC_ADDGROUP 
ioctl, which would add the Ufile’s UID to the grouplist 
of the calling process. To avoid conflicts with existing 
groups, this ioctl should be only allowed for dynamically- 
allocated UIDs. In terms of file permissions, we also 
believe that POSIX ACLs [19] are a better alternative to 
the Unix user-group-other permission bits. 


UserFS relies on the kernel to support 32-bit UIDs, as 
opposed to 16-bit UIDs from the original Unix design. 
Linux has supported 32-bit UIDs since kernel version 
2.3.39 (January 2000), but UserFS cannot support older 
file systems that can only keep track of a 16-bit UID, such 
as the original Minix filesystem. 


Our prototype faces several limitations because it is 
implemented as a loadable kernel module, and avoids 
making any extensive changes to the Linux kernel. For 
example, the chroot system call on Linux always rejects 
calls from non-root users, requiring UserFS to provide 
an alternative way of invoking chroot. Performing priv- 
ileged operations in the kernel also requires UserFS to 
sometimes change the current UID of the calling process. 
While we believe our prototype does so safely, being able 
to change permission checks inside the core kernel code 
would be both simpler and more secure in the long term. 
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If UserFS was integrated into the Linux kernel, we 
would hope to extend our chroot mechanism to also 
allow arbitrary users to use the Linux file system names- 
pace mechanism (a generalization of the mount table). 
In particular, we want to allow any process to invoke 
clone with the CLONE_NEWNS flag to create a new names- 
pace, and allow a process to change its namespace using 
mount --bind if it’s running as the same UID that in- 
voked clone (CCLONE_NEWNS), along with restrictions on 
setuid binaries similar to chroot. Similar support could 
also be added to allow users to manage the system V IPC 
namespace (CLONE_NEWIPC). 

Finally, if UserFS was integrated into the Linux kernel, 
we would also like to replace our firewall mechanism 
with a per-process iptables firewall ruleset, inherited 
by child processes across fork and clone. To specify 
new firewall rules, applications would specify a new flag 
to the clone system call to start the child process with 
a fresh iptables ruleset. To ensure that a child cannot 
escape from the parent’s firewall rules, the child’s ruleset 
would be chained to the parent’s. 


9 CONCLUSION 


This paper presented UserFS, the first system to provide 
egalitarian OS protection mechanisms for Linux. UserFS 
allows any user to use existing OS protection mechanisms, 
including Unix user IDs, chroot jails, and firewalls. This 
both allows applications to reduce their privileges, and in 
many cases avoids the need for root privileges altogether. 

One key idea in UserFS is representing user IDs as 
files in a /proc-like file system. This allows applications 
to manage user IDs much like they would any other file, 
without the need to introduce any new user ID manage- 
ment mechanisms. UserFS maintains a hierarchy of user 
IDs for accountability and resource revocation purposes, 
but allows child user IDs in the hierarchy to be made in- 
accessible to parent user IDs, in order to protect sensitive 
processes like ssh-agent from outside interference. To 
cope with a limited 32-bit user ID namespace, UserFS in- 
troduces per-UID generation numbers that disambiguate 
multiple instances of a reused 32-bit UID value. Finally, 
UserFS implements security checks that make it safe to 
allow non-root users to invoke chroot, without allow- 
ing users to escape out of existing chroot jails or abuse 
setuid executables. 

An important goal of the UserFS design is compati- 
bility with existing applications, interfaces, and kernel 
components. Porting applications to use UserFS requires 
only tens to hundreds of lines of code, and prevents attack- 
ers from exploiting application-level vulnerabilities, such 
as code injection or missing ACL checks in a PHP-based 
wiki web application. UserFS requires minimal changes 
to the Linux kernel, comprising of a single 3,000-line 
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kernel module, and incurs no performance overhead for 
most operations. 
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Abstract 


Capsicum is a lightweight operating system capabil- 
ity and sandbox framework planned for inclusion in 
FreeBSD 9. Capsicum extends, rather than replaces, 
UNIX APIs, providing new kernel primitives (sandboxed 
capability mode and capabilities) and a userspace sand- 
box API. These tools support compartmentalisation of 
monolithic UNIX applications into logical applications, 
an increasingly common goal supported poorly by dis- 
cretionary and mandatory access control. We demon- 
strate our approach by adapting core FreeBSD utilities 
and Google’s Chromium web browser to use Capsicum 
primitives, and compare the complexity and robustness 
of Capsicum with other sandboxing techniques. 


1 Introduction 


Capsicum is an API that brings capabilities to UNIX. Ca- 
pabilities are unforgeable tokens of authority, and have 
long been the province of research operating systems 
such as PSOS [16] and EROS [23]. UNIX systems have 
less fine-grained access control than capability systems, 
but are very widely deployed. By adding capability prim- 
itives to standard UNIX APIs, Capsicum gives applica- 
tion authors a realistic adoption path for one of the ideals 
of OS security: least-privilege operation. We validate our 
approach through an open source prototype of Capsicum 
built on (and now planned for inclusion in) FreeBSD 9. 
Today, many popular security-critical applications 
have been decomposed into parts with different privi- 
lege requirements, in order to limit the impact of a single 
vulnerability by exposing only limited privileges to more 
risky code. Privilege separation [17], or compartmentali- 
sation, 1s a pattern that has been adopted for applications 
such as OpenSSH, Apple’s SecurityServer, and, more re- 
cently, Google’s Chromium web browser. Compartmen- 
talisation is enforced using various access control tech- 
niques, but only with significant programmer effort and 
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significant technical limitations: current OS facilities are 
simply not designed for this purpose. 

The access control systems in conventional (non- 
capability-oriented) operating systems are Discretionary 
Access Control (DAC) and Mandatory Access Control 
(MAC). DAC was designed to protect users from each 
other: the owner of an object (such as a file) can specify 
permissions for it, which are checked by the OS when 
the object is accessed. MAC was designed to enforce 
system policies: system administrators specify policies 
(e.g. “users cleared to Secret may not read Top Secret 
documents’’), which are checked via run-time hooks in- 
serted into many places in the operating system’s kernel. 

Neither of these systems was designed to address the 
case of a single application processing many types of in- 
formation on behalf of one user. For instance, a mod- 
ern web browser must parse HTML, scripting languages, 
images and video from many untrusted sources, but be- 
cause it acts with the full power of the user, has access to 
all his or her resources (such implicit access is known as 
ambient authority). 

In order to protect user data from malicious JavaScript, 
Flash, etc., the Chromium web browser is decomposed 
into several OS processes. Some of these processes han- 
dle content from untrusted sources, but their access to 
user data is restricted using DAC or MAC mechanism 
(the process is sandboxed). 

These mechanisms vary by platform, but all require a 
significant amount of programmer effort (from hundreds 
of lines of code or policy to, in one case, 22,000 lines 
of C++) and, sometimes, elevated privilege to bootstrap 
them. Our analysis shows significant vulnerabilities in 
all of these sandbox models due to inherent flaws or in- 
correct use (see Section 5). 

Capsicum addresses these problems by introducing 
new (and complementary) security primitives to support 
compartmentalisation: capability mode and capabilities. 
Capsicum capabilities should not be confused with op- 
erating system privileges, occasionally referred to as ca- 
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Figure 1: Capsicum helps applications self-compartmentalise. 


pabilities in the OS literature. Capsicum capabilities are 
an extension of UNIX file descriptors, and reflect rights 
on specific objects, such as files or sockets. Capabilities 
may be delegated from process to process in a granular 
way in the same manner as other file descriptor types: via 
inheritance or message-passing. Operating system priv- 
ilege, on the other hand, refers to exemption from ac- 
cess control or integrity properties granted to processes 
(perhaps assigned via a role system), such as the right 
to override DAC permissions or load kernel modules. A 
fine-grained privilege policy supplements, but does not 
replace, a capability system such as Capsicum. Like- 
wise, DAC and MAC can be valuable components of a 
system security policy, but are inadequate in addressing 
the goal of application privilege separation. 

We have modified several applications, including base 
FreeBSD utilities and Chromium, to use Capsicum prim- 
itives. No special privilege is required, and code changes 
are minimal: the tcpdump utility, plagued with security 
vulnerabilities in the past, can be sandboxed with Cap- 
sicum in around ten lines of code, and Chromium can 
have OS-supported sandboxing in just 100 lines. 

In addition to being more secure and easier to use than 
other sandboxing techniques, Capsicum performs well: 
unlike pure capability systems where system calls neces- 
sarily employ message passing, Capsicum’s capability- 
aware system calls are just a few percent slower than 
their UNIX counterparts, and the gzip utility incurs a 
constant-time penalty of 2.4 ms for the security of a Cap- 
sicum sandbox (see Section 6). 


2 Capsicum design 


Capsicum is designed to blend capabilities with UNIX. 
This approach achieves many of the benefits of least- 
privilege operation, while preserving existing UNIX 
APIs and performance, and presents application authors 
with an adoption path for capability-oriented design. 
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Capsicum extends, rather than replaces, standard 
UNIX APIs by adding kernel-level primitives (a sand- 
boxed capability mode, capabilities and others) and 
userspace support code (libcapsicum and a capability- 
aware run-time linker). Together, these extensions sup- 
port application compartmentalisation, the decomposi- 
tion of monolithic application code into components that 
will run in independent sandboxes to form logical appli- 
cations, as shown in Figure 1. 

Capsicum requires application modification to exploit 
new security functionality, but this may be done grad- 
ually, rather than requiring a wholesale conversion to a 
pure capability model. Developers can select the changes 
that maximise positive security impact while minimis- 
ing unacceptable performance costs; where Capsicum re- 
places existing sandbox technology, a performance im- 
provement may even be seen. 

This model requires a number of pragmatic design 
choices, not least the decision to eschew micro-kernel ar- 
chitecture and migration to pure message-passing. While 
applications may adopt a message-passing approach, and 
indeed will need to do so to fully utilise the Capsicum 
architecture, we provide “fast paths” in the form of di- 
rect system call manipulation of kernel objects through 
delegated file descriptors. This allows native UNIX per- 
formance for file system I/O, network access, and other 
critical operations, while leaving the door open to tech- 
niques such as message-passing system calls for cases 
where that proves desirable. 


2.1 Capability mode 


Capability mode is a process credential flag set by a new 
system call, cap_enter; once set, the flag is inherited 
by all descendent processes, and cannot be cleared. Pro- 
cesses in capability mode are denied access to global 
namespaces such as the filesystem and PID namespaces 
(see Figure 2). In addition to these namespaces, there 
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are several system management interfaces that must be 
protected to maintain UNIX process isolation. These in- 
terfaces include /dev device nodes that allow physical 
memory or PCI bus access, some ioct1 operations on 
sockets, and management interfaces such as reboot and 
k1dload, which loads kernel modules. 

Access to system calls in capability mode is also re- 
stricted: some system calls requiring global namespace 
access are unavailable, while others are constrained. For 
instance, sysctl can be used to query process-local in- 
formation such as address space layout, but also to moni- 
tor a system’s network connections. We have constrained 
sysctl by explicitly marking ~30 of 3000 parameters 
as permitted in capability mode; all others are denied. 

The system calls which require constraints are 
sysctl, shm_open, which is permitted to create anony- 
mous memory objects, but not named ones, and the 
openat family of system calls. These calls already ac- 
cept a file descriptor argument as the directory to per- 
form the open, rename, etc. relative to; in capabil- 
ity mode, they are constrained so that they can only 
operate on objects “under” this descriptor. For in- 
stance, if file descriptor 4 is a capability allowing ac- 
cess to /lib, then openat (4, "libc.so.7") will suc- 
ceed, whereas openat (4, "../etc/passwd") and 
openat (4, "/etc/passwd") will not. 


2.2 Capabilities 


The most critical choice in adding capability support to a 
UNIX system is the relationship between capabilities and 
file descriptors. Some systems, such as Mach/BSD, have 
maintained entirely independent notions: Mac OS X pro- 
vides each task with both indexed capabilities (ports) and 
file descriptors. Separating these concerns 1s logical, as 
Mach ports have different semantics from file descrip- 
tors; however, confusing results can arise for application 
developers dealing with both Mach and BSD APIs, and 
we wanted to reuse existing APIs as much as possible. 
As a result, we chose to extend the file descriptor ab- 
straction, and introduce a new file descriptor type, the 
capability, to wrap and protect raw file descriptors. 

File descriptors already have some properties of ca- 
pabilities: they are unforgeable tokens of authority, and 
can be inherited by a child process or passed between 
processes that share an IPC channel. Unlike “pure” ca- 
pabilities, however, they confer very broad rights: even 
if a file descriptor is read-only, operations on meta-data 
such as fchmod are permitted. In the Capsicum model, 
we restrict these operations by wrapping the descriptor 
in a capability and permitting only authorised operations 
via the capability, as shown in Figure 3. 

The cap_new system call creates a new capability 
given an existing file descriptor and a mask of rights; 
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if the original descriptor 1s a capability, the requested 
rights must be a subset of the original rights. Capabil- 
ity rights are checked by fget, the in-kernel code for 
converting file descriptor arguments to system calls into 
in-kernel references, giving us confidence that no paths 
exist to access file descriptors without capability checks. 
Capability file descriptors, as with most others in the sys- 
tem, may be inherited across fork and exec, as well as 
passed via UNIX domain sockets. 


There are roughly 60 possible mask rights on each 
capability, striking a balance between message-passing 
(two rights: send and receive), and MAC systems (hun- 
dreds of access control checks). We selected rights 
to align with logical methods on file descriptors: sys- 
tem calls implementing semantically identical operations 
require the same rights, and some calls may require 
multiple rights. For example, pread (read to mem- 
ory) and preadv (read to a memory vector) both re- 
quire CAP_READ in a capability’s rights mask, and read 
(read bytes using the file offset) requires CAP_READ | 
CAP_SEEK in a capability’s rights mask. 


Capabilities can wrap any type of file descriptor in- 
cluding directories, which can then be passed as argu- 
ments to openat and related system calls. The «at sys- 
tem calls begin relative lookups for file operations with 
the directory descriptor; we disallow some cases when 
a capability is passed: absolute paths, paths contain- 
ing “..’ components, and AT_FDCWD, which requests a 
lookup relative to the current working directory. With 
these constraints, directory capabilities delegate file sys- 
tem namespace subsets, as shown in Figure 4. This 
allows sandboxed processes to access multiple files in 
a directory (such as the library path) without the per- 
formance overhead or complexity of proxying each file 
open via IPC to a process with ambient authority. 


The “..” restriction is a conservative design, and pre- 
vents a subtle problem similar to historic chroot vul- 
nerabilities. A single directory capability that only en- 
forces containment by preventing “..” lookup on the root 
of a subtree operates correctly; however, two colluding 
sandboxes (or a single sandbox with two capabilities) can 
race to actively rearrange a tree so that the check always 
succeeds, allowing escape from a delegated subset. It 
is possible to imagine less conservative solutions, such 
as preventing upward renames that could introduce ex- 
ploitable cycles during lookup, or additional synchroni- 
sation; these strike us as more risky tactics, and we have 
selected the simplest solution, at some cost to flexibility. 


Many past security extensions have composed poorly 
with UNIX security leading to vulnerabilities; thus, we 
disallow privilege elevation via fexecve using setuid 
and setgid binaries in capability mode. This restriction 
does not prevent setuid binaries from using sandboxes. 
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Namespace Description 


Process ID (PID) UNIX processes are identified by unique IDs. PIDs are returned by fork and used 
penne | for signal delivery, debugging, monitoring, and status collection. 
File paths UNIX files exist in a global, hierarchical namespace, which is protected by discre- 
tionary and mandatory access control. 


NES file handles 


The NES client and server identify files and directories on the wire using a flat, 


global file handle namespace. They are also exposed to processes to support the 
lock manager daemon and optimise local file access. 
File system ID File system IDs supplement paths to mount points, and are used for forceable un- 
EEN | noun ten here ald path othe moun po 
Protocol addresses Protocol families use socket addresses to name local and foreign endpoints. These 


exist in global namespaces, such as [Pv4 addresses and ports, or the file system 
namespace for local domain sockets. 


Sysctl MIB The sysctl management interface uses numbered and named entries, used to get 
or set system information, such as process lists and tuning parameters. 


System V IPC System V IPC message queues, semaphores, and shared memory segments exist in 
a flat, global integer namespace. 


POSIX IPC POSIX defines similar semaphore, message queue, and shared memory APIs, with 
an undefined namespace: on some systems, these are mapped into the file system; 
on others they are simply a flat global namespaces. 


System clocks UNIX systems provide multiple interfaces for querying and manipulating one or 
more system clocks or timers. 


The management namespace for FreeBSD-based virtualised environments. 
A global namespace for affinity policies assigned to processes and threads. 


Figure 2: Global namespaces in the FreeBSD operating kernel 


2.3 Run-time environment 


Even with Capsicum’s kernel primitives, creating sand- 
boxes without leaking undesired resources via file de- 
scriptors, memory mappings, or memory contents 1s dif- 
ficult. 1ibcapsicum therefore provides an API for start- 
ing scrubbed sandbox processes, and explicit delega- 
tion APIs to assign rights to sandboxes. libcapsicum 
cuts off the sandbox’s access to global namespaces via 
cap_enter, but also closes file descriptors not positively 
identified for delegation, and flushes the address space 
via fexecve. Sandbox creation returns a UNIX domain 
socket that applications can use for inter-process com- 
munication (IPC) between host and sandbox; it can also 
be used to grant additional rights as the sandbox runs. 


3 Capsicum implementation 


3.1 Kernel changes 


Many system call and capability constraints are applied 
at the point of implementation of kernel services, rather 
than by simply filtering system calls. The advantage 
of this approach is that a single constraint, such as the 
blocking of access to the global file system namespace, 
can be implemented in one place, namei, which is re- 
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sponsible for processing all path lookups. For example, 
one might not have expected the fexecve call to cause 
global namespace access, since it takes a file descriptor 
as its argument rather than a path for the binary to exe- 
cute. However, the file passed by file descriptor speci- 
fies its run-time linker via a path embedded in the binary, 
which the kernel will then open and execute. 


Similarly, capability rights are checked by the ker- 
nel function fget, which converts a numeric descriptor 
into a struct file reference. We have added a new 
rights argument, allowing callers to declare what ca- 
pability rights are required to perform the current oper- 
ation. If the file descriptor is a raw UNIX descriptor, 
or wrapped by a capability with sufficient rights, the op- 
eration succeeds. Otherwise, ENOTCAPABLE Is returned. 
Changing the signature of fget allows us to use the com- 
piler to detect missed code paths, providing greater assur- 
ance that all cases have been handled. 


One less trivial global namespace to handle is the pro- 
cess ID (PID) namespace, which is used for process cre- 
ation, signalling, debugging and exit status, critical op- 
erations for a logical application. Another problem for 
logical applications is that libraries cannot create and 
manage worker processes without interfering with pro- 
cess management in the application itself—unexpected 
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Figure 4: Portions of the global filesystem namespace can be delegated to sandboxed processes. 


SIGCHLD signals are delivered to the application, and un- 
expected process IDs are returned by wait. 

Process descriptors address these problems in a man- 
ner similar to Mach task ports: creating a process with 
pdfork returns a file descriptor to use for process man- 
agement tasks, such as monitoring for exit via poll. 
When the process descriptor is closed, the process is ter- 
minated, providing a user experience consistent with that 
of monolithic processes: when a user hits Ctrl-C, or the 
application segfaults, all processes in the logical applica- 
tion terminate. Termination does not occur if reference 
cycles exist among processes, suggesting the need for a 
new “logical application” primitive—see Section 7. 


3.2 The Capsicum run-time environment 


Removing access to global namespaces forces funda- 
mental changes to the UNIX run-time environment. 
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Even the most basic UNIX operations for starting pro- 
cesses and running programs have been eliminated: 
fork and exec both rely on global namespaces. Respon- 
sibility for launching a sandbox is shared. libcapsicum 
is invoked by the application, and responsible for forking 
a new process, gathering together delegated capabilities 
from both the application and run-time linker, and di- 
rectly executing the run-time linker, passing the sandbox 
binary via a capability. ELF headers normally contain a 
hard-coded path to the run-time linker to be used with the 
binary. We execute the Capsicum-aware run-time linker 
directly, eliminating this dependency on the global file 
system namespace. 


Once rt 1d-elf-cap is executing in the new process, 
it loads and links the binary using libraries loaded via li- 
brary directory capabilities set up by Libcapsicum. The 
main function of a program can call 1cs_get to deter- 
mine whether it is in a sandbox, retrieve sandbox state, 
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Figure 5: Process and components involved in creating a new 1ibcapsicum sandbox 


query creation-time delegated capabilities, and retrieve 
an IPC handle so that it can process RPCs and receive 
run-time delegated capabilities. This allows a single bi- 
nary to execute both inside and outside of a sandbox, di- 
verging behaviour based on its execution environment. 
This process is illustrated in greater detail in Figure 5. 


Once in execution, the application is linked against 
normal C libraries and has access to much of the tradi- 
tional C run-time, subject to the availability of system 
calls that the run-time depends on. An IPC channel, in 
the form of a UNIX domain socket, is set up automat- 
ically by libcapsicum to carry RPCs and capabilities 
delegated after the sandbox starts. Capsicum does not 
provide or enforce the use of a specific Interface De- 
scription Language (IDL), as existing compartmentalised 
or privilege-separated applications have their own, of- 
ten hand-coded, RPC marshalling already. Here, our 
design choice differs from historic capability systems, 
which universally have selected a specific IDL, such as 
the Mach Interface Generator (MIG) on Mach. 


libcapsicum’s fdlist (file descriptor list) abstrac- 
tion allows complex, layered applications to declare ca- 
pabilities to be passed into sandboxes, in effect provid- 
ing a sandbox template mechanism. This avoids encod- 
ing specific file descriptor numbers into the ABI between 
applications and their sandbox components, a technique 
used in Chromium that we felt was likely to lead to pro- 
gramming errors. Of particular concern is hard-coding of 
file descriptor numbers for specific purposes, when those 
descriptor numbers may already have been used by other 
layers of the system. Instead, application and library 
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components declare process-local names bound to file 
descriptor numbers before creating the sandbox; match- 
ing components in the sandbox can then query those 
names to retrieve (possibly renumbered) file descriptors. 


4 Adapting applications to use Capsicum 


Adapting applications for use with sandboxing is a non- 
trivial task, regardless of the framework, as it requires 
analysing programs to determine their resource depen- 
dencies, and adopting a distributed system programming 
style in which components must use message passing or 
explicit shared memory rather than relying on a common 
address space for communication. In Capsicum, pro- 
grammers have a choice of working directly with capa- 
bility mode or using 1ibcapsicum to create and manage 
sandboxes, and each model has its merits and costs in 
terms of development complexity, performance impact, 
and security: 


1. Modify applications to use cap_enter directly in 
order to convert an existing process with ambient 
privilege into a capability mode process inheriting 
only specific capabilities via file descriptors and vir- 
tual memory mappings. This works well for ap- 
plications with a simple structure like: open all re- 
sources, then process them in an I/O loop, such as 
programs operating in a UNIX pipeline, or interact- 
ing with the network for the purposes of a single 
connection. The performance overhead will typi- 
cally be extremely low, as changes consist of encap- 


USENIX Association 


sulating broad file descriptor rights into capabilities, 
followed by entering capability mode. We illustrate 
this approach with tcpdump. 


2. Use cap_enter to reinforce the sandboxes of ap- 
plications with existing privilege separation or com- 
partmentalisation. These applications have a more 
complex structure, but are already aware that some 
access limitations are in place, so have already been 
designed with file descriptor passing in mind. Re- 
fining these sandboxes can significantly improve se- 
curity in the event of a vulnerability, as we show 
for dhclient and Chromium; the performance and 
complexity impact of these changes will be low 
because the application already adopts a message 
passing approach. 


3. Modify the application to use the full 
libcapsicum API, introducing new compart- 
mentalisation or reformulating existing privilege 
separation. This offers significantly stronger 
protection, by virtue of flushing capability lists and 
residual memory from the host environment, but at 
higher development and run-time costs. Boundaries 
must be identified in the application such that not 
only is security improved (1.e., code processing 
risky data is isolated), but so that resulting perfor- 
mance is sufficiently efficient. We illustrate this 
technique using modifications to gzip. 


Compartmentalised application development is, of ne- 
cessity, distributed application development, with soft- 
ware components running in different processes and 
communicating via message passing. Distributed debug- 
ging is an active area of research, but commodity tools 
are unsatisfying and difficult to use. While we have not 
attempted to extend debuggers, such as gdb, to better 
support distributed debugging, we have modified a num- 
ber of FreeBSD tools to improve support for Capsicum 
development, and take some comfort in the generally 
synchronous nature of compartmentalised applications. 

The FreeBSD procstat command inspects kernel- 
related state of running processes, including file descrip- 
tors, virtual memory mappings, and security credentials. 
In Capsicum, these resource lists become capability lists, 
representing the rights available to the process. We have 
extended procstat to show new Capsicum-related in- 
formation, such as capability rights masks on file de- 
scriptors and a flag in process credential listings to indi- 
cate capability mode. As a result, developers can directly 
inspect the capabilities inherited or passed to sandboxes. 

When adapting existing software to run in capability 
mode, identifying capability requirements can be tricky; 
often the best technique is to discover them through 
dynamic analysis, identifying missing dependencies by 
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tracing real-world use. To this end, capability-related 
failures return a new errno value, ENOTCAPABLE, dis- 
tinguishing them from other failures, and system calls 
such as open are blocked in namei, rather than the sys- 
tem call boundary, so that paths are shown in FreeBSD’s 
kt race facility, and can be utilised in DTrace scripts. 

Another common compartmentalised development 
strategy is to allow the multi-process logical application 
to be run as a single process for debugging purposes. 
libcapsicum provides an API to query whether sand- 
boxing for the current application or component is en- 
abled by policy, making it easy to enable and disable 
sandboxing for testing. As RPCs are generally syn- 
chronous, the thread stack in the sandbox process 1s logi- 
cally an extension of the thread stack in the host process, 
which makes the distributed debugging task less fraught 
than it otherwise might appear. 


4.1 tcpdump 


tcpdump provides an excellent example of Capsicum 
primitives offering immediate wins through straight- 
forward changes, but also the subtleties that arise when 
compartmentalising software not written with that goal 
in mind. tcpdump has a simple model: compile a pat- 
tern into a BPF filter, configure a BPF device as an in- 
put source, and loop writing captured packets rendered as 
text. This structure lends itself to sandboxing: resources 
are acquired early with ambient privilege, and later pro- 
cessing depends only on held capabilities, so can execute 
in capability mode. The two-line change shown in Fig- 
ure 6 implements this conversion. 

This significantly improves security, as historically 
fragile packet-parsing code now executes with reduced 
privilege. However, further analysis with the procstat 
tool is required to confirm that only desired capabili- 
ties are exposed. While there are few surprises, un- 
constrained access to a user’s terminal connotes signif- 
icant rights, such as access to key presses. A refinement, 
shown in Figure 7, prevents reading stdin while still al- 
lowing output. Figure 8 illustrates procstat on the re- 
sulting process, including capabilities wrapping file de- 
scriptors in order to narrow delegated rights. 

kt race reveals another problem, 1ibc DNS resolver 
code depends on file system access, but not until after 
cap_enter, leading to denied access and lost function- 
ality, as shown in Figure 9. 

This illustrates a subtle problem with sandboxing: 
highly layered software designs often rely on on-demand 
initialisation, lowering or avoiding startup costs, and 
those initialisation points are scattered across many com- 
ponents in system and application code. This is corrected 
by switching to the lightweight resolver, which sends 
DNS queries to a local daemon that performs actual res- 
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if (cap_enter() < 0) 


Status 


error("cap_enter: %s", pcap_strerror(errno)); 
= pcap_loop(pd, cnt, callback, pcap_userdata) ; 


Figure 6: A two-line change adding capability mode to tcpdump: cap_enter 1s called prior to the main libpcap 
(packet capture) work loop. Access to global file system, IPC, and network namespaces is restricted. 


+++ + 4+ + 


if (lc_limitfd(STDIN_FILENO, CAP_FSTAT) < 0) 

error Le: limiter d:, uneble- to limit STDIN FLGENO"™y< 

if (lc_limitfd(STDOUT_FILENO, CAP_FSTAT | CAP_SEEK | CAP_WRITE) < Q) 
error(" c. simit id: wnable to damit. STDOUT FLGENG™).4 

if (1c Jamitid(STDERR. FI LENO; -CAP_FSTAT.| -CAP SEEK. | -CAP WRITE) <Q) 
error ("1c lamtid: unable: ‘to: Mimit STDERR FPLLENO™) 3 


Figure 7: Using lc_limitfd, tcpdump can further narrow rights delegated by inherited file descriptors, such as 
limiting permitted operations on STDIN to fstat. 
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Figure 8: procstat -fc displays capabilities held by a process; FLAGS represents the file open flags, whereas 
CAPABILITIES represents the capabilities rights mask. In the case of STDIN, only fstat (fs) has been granted. 
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kt race reveals a problem: DNS resolution depends on file system and TCP/IP namespaces after cap_enter. 
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Figure 10: Capabilities held by dhclient before Capsicum changes: several unnecessary rights are present. 
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olution, addressing both file system and network address 
namespace concerns. Despite these limitations, this ex- 
ample of capability mode and capability APIs shows that 
even minor code changes can lead to dramatic security 
improvements, especially for a critical application with a 
long history of security problems. 


4.2 dhclient 


FreeBSD ships the OpenBSD DHCP client, which in- 
cludes privilege separation support. On BSD systems, 
the DHCP client must run with privilege to open BPF 
descriptors, create raw sockets, and configure network 
interfaces. This creates an appealing target for attackers: 
network code exposed to a complex packet format while 
running with root privilege. The DHCP client is afforded 
only weak tools to constrain operation: it starts as the 
root user, opens the resources its unprivileged compo- 
nent will require (raw socket, BPF descriptor, lease con- 
figuration file), forks a process to continue privileged ac- 
tivities (such as network configuration), and then con- 
fines the parent process using chroot and the setuid 
family of system calls. Despite hardening of the BPF 
ioctl interface to prevent reattachment to another in- 
terface or reprogramming the filter, this confinement is 
weak; chroot limits only file system access, and switch- 
ing credentials offers poor protection against weak or in- 
correctly configured DAC protections on the sysct1 and 
PID namespaces. 

Through a similar two-line change to that in tcpdump, 
we can reinforce (or, through a larger change, replace) 
existing sandboxing with capability mode. This instantly 
denies access to the previously exposed global names- 
paces, while permitting continued use of held file de- 
scriptors. As there has been no explicit flush of address 
space, memory, or file descriptors, it is important to ana- 
lyze what capabilities have been leaked into the sandbox, 
the key limitation to this approach. Figure 10 shows a 
procstat -fC analysis of the file descriptor array. 

The existing dhclient code has done an effective job 
at eliminating directory access, but continues to allow the 
sandbox direct rights to submit arbitrary log messages to 
syslogd, modify the lease database, and a raw socket on 
which a broad variety of operations could be performed. 
The last of these is of particular interest due to ioctl; 
although dhclient has given up system privilege, many 
network socket ioct1s are defined, allowing access to 
system information. These are blocked in Capsicum’s 
capability mode. 

It is easy to imagine extending existing privilege sep- 
aration in dhclient to use the Capsicum capability fa- 
cility to further constrain file descriptors inherited in the 
sandbox environment, for example, by limiting use of 
the IP raw socket to send and recv, disallowing ioctl. 
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Use of the libcapsicum API would require more sig- 
nificant code changes, but as dhclient already adopts a 
message passing structure to communicate with its com- 
ponents, it would be relatively straight forward, offer- 
ing better protection against capability and memory leak- 
age. Further migration to message passing would pre- 
vent arbitrary log messages or direct unformatted writes 
to dhclient .leases.em by constraining syntax. 


4.3 gzip 


The gzip command line tool presents an interesting tar- 
get for conversion for several reasons: it implements 
risky compression/decompression routines that have suf- 
fered past vulnerabilities, it contains no existing com- 
partmentalisation, and it executes with ambient user 
(rather than system) privileges. Historic UNIX sandbox- 
ing techniques, such as chroot and ephemeral UIDs are 
a poor match because of their privilege requirement, but 
also because (unlike with dhclient), there’s no expecta- 
tion that a single sandbox exist—many gzip sessions 
can run independently for many different users, and there 
can be no assumption that placing them in the same sand- 
box provides the desired security properties. 

The first step is to identify natural fault lines in the ap- 
plication: for example, code that requires ambient priv- 
ilege (due to opening files or building network connec- 
tions), and code that performs more risky activities, such 
as parsing data and managing buffers. In gzip, this split 
is immediately obvious: the main run loop of the ap- 
plication processes command line arguments, identifies 
streams and objects to perform processing on and send 
results to, and then feeds them to compress routines that 
accept input and output file descriptors. This suggests a 
partitioning in which pairs of descriptors are submitted to 
a sandbox for processing after the ambient privilege pro- 
cess opens them and performs initial header handling. 

We modified gzip to use libcapsicum, intercept- 
ing three core functions and optionally proxying them 
using RPCs to a sandbox based on policy queried from 
libcapsicum, as shown in Figure 11. Each RPC passes 
two capabilities, for input and output, to the sandbox, as 
well as miscellaneous fields such as returned size, orig- 
inal filename, and modification time. By limiting capa- 
bility rights to a combination of CAP_READ, CAP_WRITE, 
and CAP_SEEK, a tightly constrained sandbox is created, 
preventing access to any other files in the file system, or 
other globally named resources, in the event a vulnera- 
bility in compression code is exploited. 

These changes add 409 lines (about 16%) to the size of 
the gzip source code, largely to marshal and un-marshal 
RPCs. In adapting gzip, we were initially surprised to 
see a performance improvement; investigation of this un- 
likely result revealed that we had failed to propagate the 
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Function RPC 
PROXIED_GZ_-COMPRESS 
gz_uncompress PROXIED_GZ_UNCOMPRESS 
unbZzip2 PROXIED_UNBZIP2 


gz_compress 


Description 

zlib-based compression 
zlib-based decompression 
bzip2-based decompression 


Figure 11: Three gzip functions are proxied via RPC to the sandbox 


compression level (a global variable) into the sandbox, 
leading to the incorrect algorithm selection. This serves 
as reminder that code not originally written for decompo- 
sition requires careful analysis. Oversights such as this 
one are not caught by the compiler: the variable was cor- 
rectly defined in both processes, but never propagated. 


Compartmentalisation of gzip raises an important de- 
sign question when working with capability mode: the 
changes were small, but non-trivial: is there a better 
way to apply sandboxing to applications most frequently 
used in pipelines? Seaborn has suggested one possi- 
bility: a Principle of Least Authority Shell (PLASH), 
in which the shell runs with ambient privilege and 
pipeline components are placed in sandboxes by the 
shell [21]. We have begun to explore this approach on 
Capsicum, but observe that the design tension exists here 
as well: gzip’s non-pipeline mode performs a number of 
application-specific operations requiring ambient privi- 
lege, and logic like this may be equally Gf not more) 
awkward if placed in the shell. On the other hand, when 
operating purely in a pipeline, the PLASH approach of- 
fers the possibility of near-zero application modification. 


Another area we are exploring is library self- 
compartmentalisation. With this approach, library code 
sandboxes portions of itself transparently to the host ap- 
plication. This approach motivated a number of our de- 
sign choices, especially as relates to the process model: 
masking SIGCHLD delivery to the parent when using pro- 
cess descriptors allows libraries to avoid disturbing ap- 
plication state. This approach would allow video codec 
libraries to sandbox portions of themselves while exe- 
cuting in an unmodified web browser. However, library 
APIs are often not crafted for sandbox-friendliness: one 
reason we placed separation in gzip rather than 1ibz is 
that gzip provided internal APIs based on file descrip- 
tors, whereas 1ibz provided APIs based on buffers. For- 
warding capabilities offers full UNIX I/O performance, 
whereas the cost of performing RPCs to transfer buffers 
between processes scales with file size. Likewise, his- 
toric vulnerabilities in 1ibjpeg have largely centred on 
callbacks to applications rather than existing in isolation 
in the library; such callback interfaces require significant 
changes to run in an RPC environment. 
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4.4 Chromium 


Google’s Chromium web browser uses a multi-process 
architecture similar to a Capsicum logical application to 
improve robustness [18]. In this model, each tab 1s as- 
sociated with a renderer process that performs the risky 
and complex task of rendering page contents through 
page parsing, image rendering, and JavaScript execution. 
More recent work on Chromium has integrated sandbox- 
ing techniques to improve resilience to malicious attacks 
rather than occasional instability; this has been done in 
various ways on different supported operating systems, 
as we will discuss in detail in Section 5. 

The FreeBSD port of Chromium did not include sand- 
boxing, and the sandboxing facilities provided as part of 
the similar Linux and Mac OS X ports bear little resem- 
blance to Capsicum. However, the existing compartmen- 
talisation meant that several critical tasks had already 
been performed: 


e Chromium assumes that processes can be converted 
into sandboxes that limit new object access 


e Certain services were already forwarded to render- 
ers, such as font loading via passed file descriptors 


e Shared memory is used to transfer output between 
renderers and the web browser 


e Chromium contains RPC marshalling and passing 
code in all the required places 


The only significant Capsicum change to the FreeBSD 
port of Chromium was to switch from System V shared 
memory (permitted in Linux sandboxes) to the POSIX 
shared memory code used in the Mac OS X port 
(capability-oriented and permitted in Capsicum’s capa- 
bility mode). Approximately 100 additional lines of code 
were required to introduce calls to 1c_limitfd to limit 
access to file descriptors inherited by and passed to sand- 
box processes, such as Chromium data pak files, st dio, 
and /dev/random, font files, and to call cap_enter. 
This compares favourably with the 4.3 million lines of 
code in the Chromium source tree, but would not have 
been possible without existing sandbox support in the de- 
sign. We believe it should be possible, without a signif- 
icantly larger number of lines of code, to explore using 
the libcapsicum API directly. 
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Operating system Model Line count 
Windows ACLs 22,350 
Linux chroot 605 

Mac OS X Seatbelt 560 
Linux SELinux 200 

Linux seccomp 11,301 
FreeBSD Capsicum 100 


Description 

Windows ACLs and SIDs 

setuid root helper sandboxes renderer 
Path-based MAC sandbox 

Restricted sandbox type enforcement domain 
seccomp and userspace syscall wrapper 
Capsicum sandboxing using cap_enter 


Figure 12: Sandboxing mechanisms employed by Chromium. 


5 Comparison of sandboxing technologies 


We now compare Capsicum to existing sandbox mecha- 
nisms. Chromium provides an ideal context for this com- 
parison, as it employs six sandboxing technologies (see 
Figure 12). Of these, the two are DAC-based, two MAC- 
based and two capability-based. 


5.1 Windows ACLs and SIDs 


On Windows, Chromium uses DAC to create sand- 
boxes [18]. The unsuitability of inter-user protections for 
the intra-user context is demonstrated well: the model 
is both incomplete and unwieldy. Chromium uses Ac- 
cess Control Lists (ACLs) and Security Identifiers (SIDs) 
to sandbox renderers on Windows. Chromium creates a 
modified, reduced privilege, SID, which does not appear 
in the ACL of any object in the system, in effect running 
the renderer as an anonymous user. 

However, objects which do not support ACLs are not 
protected by the sandbox. In some cases, additional pre- 
cautions can be used, such as an alternate, invisible desk- 
top to protect the user’s GUI environment. However, un- 
protected objects include FAT filesystems on USB sticks 
and TCP/IP sockets: a sandbox cannot read user files di- 
rectly, but it may be able to communicate with any server 
on the Internet or use a configured VPN! USB sticks 
present a significant concern, as they are frequently used 
for file sharing, backup, and protection from malware. 

Many legitimate system calls are also denied to the 
sandboxed process. These calls are forwarded by the 
sandbox to a trusted process responsible for filtering and 
serving them. This forwarding comprises most of the 
22,000 lines of code in the Windows sandbox module. 


5.2 Linux chroot 


Chromium’s suid sandbox on Linux also attempts to 
create a privilege-free sandbox using legacy OS access 
control; the result is similarly porous, with the additional 
risk that OS privilege is required to create a sandbox. 

In this model, access to the filesystem is limited to a 
directory via chroot: the directory becomes the sand- 
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box’s virtual root directory. Access to other namespaces, 
including System V shared memory (where the user’s 
X window server can be contacted) and network access, 
is unconstrained, and great care must be taken to avoid 
leaking resources when entering the sandbox. 
Furthermore, initiating chroot requires a setuid bi- 
nary: a program that runs with full system privilege. 
While comparable to Capsicum’s capability mode in 
terms of intent, this model suffers significant sandboxing 
weakness (for example, permitting full access to the Sys- 
tem V shared memory as well as all operations on passed 
file descriptors), and comes at the cost of an additional 
setuid-root binary that runs with system privilege. 


5.3. MAC OS X Seatbelt 


On Mac OS X, Chromium uses a MAC-based framework 
for creating sandboxes. This allows Chromium to create 
a stronger sandbox than is possible with DAC, but the 
rights that are granted to render processes are still very 
broad, and security policy must be specified separately 
from the code that relies on it. 

The Mac OS X Seatbelt sandbox system allows pro- 
cesses to be constrained according to a LISP-based pol- 
icy language [1]. It uses the MAC Framework [27] to 
check application activities; Chromium uses three poli- 
cies for different components, allowing access to filesys- 
tem elements such as font directories while restricting 
access to the global namespace. 

Like other techniques, resources are acquired before 
constraints are imposed, so care must be taken to avoid 
leaking resources into the sandbox. Fine-grained filesys- 
tem constraints are possible, but other namespaces such 
as POSIX shared memory, are an all-or-nothing affair. 
The Seatbelt-based sandbox model is less verbose than 
other approaches, but like all MAC systems, security pol- 
icy must be expressed separately from code. This can 
lead to inconsistencies and vulnerabilities. 


5.4 SELinux 


Chromium’s MAC approach on Linux uses an SELinux 
Type Enforcement policy [12]. SELinux can be used 
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for very fine-grained rights assignment, but in practice, 
broad rights are conferred because fine-grained Type En- 
forcement policies are difficult to write and maintain. 
The requirement that an administrator be involved in 
defining new policy and applying new types to the file 
system is a significant inflexibility: application policies 
cannot adapt dynamically, as system privilege is required 
to reformulate policy and relabel objects. 

The Fedora reference policy for Chromium creates a 
single SELinux dynamic domain, chrome_sandbox.-t, 
which is shared by all sandboxes, risking potential in- 
terference between sandboxes. This domain is assigned 
broad rights, such as the ability to read all files in /etc 
and access to the terminal device. These broad policies 
are easier to craft than fine-grained ones, reducing the 
impact of the dual-coding problem, but are much less ef- 
fective, allowing leakage between sandboxes and broad 
access to resources outside of the sandbox. 

In contrast, Capsicum eliminates dual-coding by com- 
bining security policy with code in the application. This 
approach has benefits and drawbacks: while bugs can’t 
arise due to potential inconsistency between policy and 
code, there is no longer an easily accessible specification 
of policy to which static analysis can be applied. This 
reinforces our belief that systems such as Type Enforce- 
ment and Capsicum are potentially complementary, serv- 
ing differing niches in system security. 


5.5 Linux seccomp 


Linux provides an optionally-compiled capability mode- 
like facility called seccomp. Processes in seccomp 
mode are denied access to all system calls except read, 
write, and exit. At face value, this seems promis- 
ing, but as OS infrastructure to support applications us- 
ing seccomp 1s minimal, application writers must go to 
significant effort to use it. 

In order to allow other system calls, Chromium 
constructs a process in which one thread executes in 
seccomp mode, and another “trusted” thread sharing 
the same address space has normal system call access. 
Chromium rewrites glibc and other library system call 
vectors to forward system calls to the trusted thread, 
where they are filtered in order to prevent access to inap- 
propriate shared memory objects, opening files for write, 
etc. However, this default policy is, itself, quite weak, as 
read of any file system object is permitted. 

The Chromium seccomp sandbox contains over a 
thousand lines of hand-crafted assembly to set up sand- 
boxing, implement system call forwarding, and craft a 
basic security policy. Such code is a risky proposition: 
difficult to write and maintain, with any bugs likely lead- 
ing to security vulnerabilities. The Capsicum approach 
is similar to that of seccomp, but by offering a richer set 
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of services to sandboxes, as well as more granular dele- 
gation via capabilities, it is easier to use correctly. 


6 Performance evaluation 


Typical operating system security benchmarking is tar- 
geted at illustrating zero or near-zero overhead in the 
hopes of selling general applicability of the resulting 
technology. Our thrust is slightly different: we know 
that application authors who have already begun to adopt 
compartmentalisation are willing to accept significant 
overheads for mixed security return. Our goal is there- 
fore to accomplish comparable performance with signif- 
icantly improved security. 

We evaluate performance in two ways: first, a set 
of micro-benchmarks establishing the overhead intro- 
duced by Capsicum’s capability mode and capability 
primitives. As we are unable to measure any notice- 
able performance change in our adapted UNIX applica- 
tions (tcpdump and dhclient) due to the extremely low 
cost of entering capability mode from an existing pro- 
cess, we then turn our attention to the performance of 
our libcapsicum-enhanced gzip. 

All performance measurements have been performed 
on an 8-core Intel Xeon E5320 system running at 
1.86GHz with 4GB of RAM, running either an unmod- 
ified FreeBSD 8-STABLE distribution synchronised to 
revision 201781 (2010-01-08) from the FreeBSD Sub- 
version repository, or a synchronised 8-STABLE distri- 
bution with our capability enhancements. 


6.1 System call performance 


First, we consider system call performance through 
micro-benchmarking. Figure 13 summarises these re- 
sults for various system calls on unmodified FreeBSD, 
and related capability operations in Capsicum.  Fig- 
ure 14 contains a table of benchmark timings. All micro- 
benchmarks were run by performing the target operation 
in a tight loop over an interval of at least 10 seconds, 
repeating for 10 iterations. Differences were computed 
using Student’s t-test at 95% confidence. 

Our first concern is with the performance of capabil- 
ity creation, as compared to raw object creation and the 
closest UNIX operation, dup. We observe moderate, but 
expected, performance overheads for capability wrap- 
ping of existing file descriptors: the cap_new syscall is 
50.7% + 0.08% slower than dup, or 539 + 0.8ns slower 
in absolute terms. 

Next, we consider the overhead of capability “un- 
wrapping”, which occurs on every descriptor operation. 
We compare the cost of some simple operations on raw 
file descriptors, to the same operations on a capability- 
wrapped version of the same file descriptor: writing a 
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Figure 13: Capsicum system call performance compared to standard UNIX calls. 


single byte to /dev/null, reading a single byte from 
/dev/zero; reading 10000 bytes from /dev/zero; and 
performing an fstat call on a shared memory file de- 
scriptor. In all cases we observe a small overhead of 
about 0.06j.s when operating on the capability-wrapped 
file descriptor. This has the largest relative performance 
impact on fstat (since it does not perform I/O, simply 
inspecting descriptor state, it should thus experience the 
highest overhead of any system call which requires un- 
wrapping). Even in this case the overhead is relatively 
low: 10.2% + 0.5%. 


6.2 Sandbox creation 


Capsicum supports ways to create a sandbox: directly in- 
voking cap_enter to convert an existing process into a 
sandbox, inheriting all current capability lists and mem- 
ory contents, and the 1ibcapsicum sandbox API, which 
creates a new process with a flushed capability list. 

cap_enter performs similarly to chroot, used by 
many existing compartmentalised applications to restrict 
file system access. However, cap_enter out-performs 
setuid as it does not need to modify resource limits. 
As most sandboxes chroot and set the UID, entering a 
capability mode sandbox is roughly twice as fast as a tra- 
ditional UNIX sandbox. This suggests that the overhead 
of adding capability mode support to an application with 
existing compartmentalisation will be negligible, and re- 
placing existing sandboxing with cap_enter may even 
marginally improve performance. 

Creating a new sandbox process and replacing its ad- 
dress space using execve 1S an expensive operation. 
Micro-benchmarks indicate that the cost of fork 1s three 
orders of magnitude greater than manipulating the pro- 
cess credential, and adding execve or even a single in- 
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stance of message passing increases that cost further. 
We also found that additional dynamically linked l- 
brary dependencies (1ibcapsicum and its dependency 
on libsbuf) impose an additional 9% cost to the fork 
syscall, presumably due to the additional virtual mem- 
ory mappings being copied to the child process. This 
overhead is not present on vfork which we plan to use 
in 1ibcapsicum in the future. Creating, exchanging an 
RPC with, and destroying a single sandbox (the “sand- 
box” label in Figure 13(b)) has a cost of about 1.5ms, 
significantly higher than its subset components. 


6.3 gzip performance 


While the performance cost of cap_enter is negli- 
gible compared to other activity, the cost of multi- 
process sandbox creation (already taken by dhclient 
and Chromium due to existing sandboxing) is significant. 

To measure the cost of process sandbox creation, we 
timed gzip compressing files of various sizes. Since the 
additional overheads of sandbox creation are purely at 
startup, we expect to see a constant-time overhead to the 
capability-enhanced version of gzip, with identical lin- 
ear scaling of compression performance with input file 
size. Files were pre-generated on a memory disk by read- 
ing a constant-entropy data source: /dev/zero for per- 
fectly compressible data, /dev/ random for perfectly in- 
compressible data, and base 64-encoded /dev/random 
for a moderate high entropy data source, with about 24% 
compression after gzipping. Using a data source with ap- 
proximately constant entropy per bit minimises variation 
in overall gzip performance due to changes in compres- 
sor performance as files of different sizes are sampled. 
The list of files was piped to xargs -n 1 gzip -c 
> /dev/null, which sequentially invokes a new gzip 
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Benchmark Time/operation % difference 


dup 1.061 + 0.000uUs 
Ccap_new 1.600 + 0.001 Us 
shmfd 2.385 + 0.000us 


4.159 + 0.00785 


cap_new_shmfd 
fstat_shmfd 


read_10000 





0.539 + 0.001pys 50.7% + 0.08% 


74.4% + 0.181% 


1.77 + 0.004us 


0.532 + 0.001jus - - 
fstat_cap_shmfd 0.054 + 0.0035 10.2% + 0.506% 
read_l 0.640 + 0.00085 - - 


1.534 + 0.000us - - 
cap_read_10000 1.601 + 0.003uUs 0.067 + 0.002s 4.40% + 0.139% 


write 0.576 + 0.0008 - 
cap_write 0.634 + 0.002s 0.058 + 0.001 ps 10.0% + 0.241% 


cap_enter 1.220 + 0.000uUs 
getuid 0.353 + 0.001ps 
chroot 1.214 + 0.000us 
setuid 1.390 + 0.001 Us 
fork 268.934 + 0.319us 
vfork 44.548 + 0.0678 
pdfork 259.359 + 0.118 us 


pingpong 309.387 + 1.588 us 
fork_exec 811.993 + 2.849Us 
585.830 + 1.635us 
862.823 + 0.554s 


vitork_exec 
pdfork_exec 


sandbox 1509.258 + 3.016us 





—0.867 + 0.001lyus | —71.0% + 0.067% 
—0.006 + 0.0005 | —0.458% + 0.023% 
0.170 + 0.001ps 14.0% + 0.054% 


—224.340.217us | —83.4% + 0.081% 
—9.58 + 0.3248 —3.56% + 0.120% 
A0.5 + 1.088 15.0% + 0.400% 


—226.2+2.183us | —27.9% + 0.269% 


6.26% + 0.348% 
85.9% + 0.339% 


50.8 + 2.8318 
697.3 + 2.788 


Figure 14: Micro-benchmark results for various system calls and functions, grouped by category. 


compression process with a single file argument, and dis- 
cards the compressed output. Sufficiently many input 
files were generated to provide at least 10 seconds of re- 
peated gzip invocations, and the overall run-time mea- 
sured. I/O overhead was minimised by staging files on 
a memory disk. The use of xargs to repeatedly invoke 
gzip provides a tight loop that minimising the time be- 
tween xargs’ successive vfork and exec calls of gzip. 
Each measurement was repeated 5 times and averaged. 

Benchmarking gzip shows high initial overhead, 
when compressing single-byte files, but also that the ap- 
proach in which file descriptors are wrapped in capabil- 
ities and delegated rather than using pure message pass- 
ing, leads to asymptotically identical behaviour as file 
size increases and run-time cost are dominated by com- 
pression workload, which is unaffected by Capsicum. 
We find that the overhead of launching a sandboxed gzip 
is 2.37 + 0.01 ms, independent of the type of compres- 
sion stream. For many workloads, this one-off perfor- 
mance cost is negligible, or can be amortised by passing 
multiple files to the same gzip invocation. 


7 Future work 


Capsicum provides an effective platform for capability 
work on UNIX platforms. However, further research and 
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development are required to bring this project to fruition. 

We believe further refinement of the Capsicum prim- 
itives would be useful. Performance could be improved 
for sandbox creation, perhaps employing an Capsicum- 
centric version of the S-thread primitive proposed by Bit- 
tau. Further, a “logical application” OS construct might 
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Figure 15: Run time per gzip invocation against random 
data, with varying file sizes; performance of the two ver- 
sions come within 57% of one another at around a 512K. 
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improve termination properties. 

Another area for research is in integrating user in- 
terfaces and OS security; Shapiro has proposed that 
capability-centered window systems are a natural ex- 
tension to capability operating systems. Improving the 
mapping of application security constructs into OS sand- 
boxes would also significantly improve the security of 
Chromium, which currently does not consistently assign 
web security domains to sandboxes. It is in the con- 
text of windowing systems that we have found capability 
delegation most valuable: by driving delegation with UI 
behaviors, such as Powerboxes (file dialogues running 
with ambient authority) and drag-and-drop, Capsicum 
can support gesture-based access control research. 

Finally, it is clear that the single largest problem 
with Capsicum and other privilege separation approaches 
is programmability: converting local development into 
de facto distributed development adds significant com- 
plexity to code authoring, debugging, and maintenance. 
Likewise, aligning security separation with application 
separation is a key challenge: how does the programmer 
identify and implement compartmentalisations that offer 
real security benefits, and determine that they’ve done 
so correctly? Further research in these areas is critical 
if systems such as Capsicum are to be used to mitigate 
security vulnerabilities through process-based compart- 
mentalisation on a large scale. 


$8 Related work 


In 1975, Saltzer and Schroeder documented a vocabulary 
for operating system security based on on-going work 
on MULTICS [19]. They described the concepts of ca- 
pabilities and access control lists, and observed that in 
practice, systems combine the two approaches in order 
to offer a blend of control and performance. Thirty-five 
years of research have explored these and other security 
concepts, but the themes remain topical. 


8.1 Discretionary and Mandatory Access 
Control 


The principle of discretionary access control (DAC) is 
that users control protections on objects they own. While 
DAC remains relevant in multi-user server environments, 
the advent of personal computers and mobile phones has 
revealed its weakness: on a single-user computer, all 
eggs are in one basket. Section 5.1 demonstrates the dif- 
ficulty of using DAC for malicious code containment. 
Mandatory access control systemically enforce poli- 
cies representing the interests of system implementers 
and administrators. Information flow policies tag sub- 
jects and objects in the system with confidentiality 
and integrity labels—fixed rules prevent reads or writes 
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that allowing information leakage. Multi-Level Secu- 
rity (MLS), formalised as Bell-LaPadula (BLP), protects 
confidential information from unauthorised release [3]. 
MLS’s logical dual, the Biba integrity policy, imple- 
ments a similar scheme protecting integrity, and can be 
used to protect Trusted Computing Bases (TCBs) [4]. 
MAC policies are robust against the problem of con- 
fused deputies, authorised individuals or processes who 
can be tricked into revealing confidential information. In 
practice, however, these policies are highly inflexible, re- 
quiring administrative intervention to change, which pre- 
cludes browsers creating isolated and ephemeral sand- 
boxes “on demand” for each web site that is visited. 
Type Enforcement (TE) in LOCK [20] and, later, 
SELinux [12] and SEBSD [25], offers greater flexibil- 
ity by allowing arbitrary labels to be assigned to sub- 
jects (domains) and objects (types), and a set of rules 
to control their interactions. As demonstrated in Sec- 
tion 5.4, requiring administrative intervention and the 
lack of a facility for ephemeral sandboxes limits appli- 
cability for applications such as Chromium: policy, by 
design, cannot be modified by users or software authors. 
Extreme granularity of control is under-exploited, or per- 
haps even discourages, highly granular protection—for 
example, the Chromium SELinux policy conflates dif- 
ferent sandboxes allowing undesirable interference. 


8.2 Capability systems, micro-kernels, and 
compartmentalisation 


The development of capability systems has been tied to 
mandatory access control since conception, as capabil- 
ities were considered the primitive of choice for media- 
tion in trusted systems. Neumann et al’s Provably Secure 
Operating System (PSOS) [16], and successor LOCK, 
propose a tight integration of the two models, with the 
later refinement that MAC allows revocation of capabili- 
ties in order to enforce the *-property [20]. 

Despite experimental hardware such as Wilkes’ CAP 
computer [28], the eventual dominance of general- 
purpose virtual memory as the nearest approximation 
of hardware capabilities lead to exploration of object- 
capability systems and micro-kernel design. Systems 
such as Mach [2], and later L4 [11], epitomise this ap- 
proach, exploring successively greater extraction of his- 
toric kernel components into separate tasks. Trusted 
Operating system research built on this trend through 
projects blending mandatory access control with micro- 
kernels, such as Trusted Mach [6], DTMach [22] and 
FLASK [24]. Micro-kernels have, however, been largely 
rejected by commodity OS vendors in favour of higher- 
performance monolithic kernels. 

MAC has spread, without the benefits of micro-kernel- 
enforced reference monitors, to commodity UNIX sys- 
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tems in the form of SELinux [12]. Operating system ca- 
pabilities, another key security element to micro-kernel 
systems, have not seen wide deployment; however, re- 
search has continued in the form of EROS [23] (now 
CapROS), inspired by KEYKOS [9]. 

OpenSSH privilege separation [17] and Privman [10] 
rekindled interest in micro-kernel-like compartmentali- 
sation projects, such as the Chromium web browser [18] 
and Capsicum’s logical applications. In fact, large ap- 
plication suites compare formidably with the size and 
complexity of monolithic kernels: the FreeBSD kernel is 
composed of 3.8 million lines of C, whereas Chromium 
and WebKit come to a total of 4.1 million lines of 
C++. How best to decompose monolithic applications re- 
mains an open research question; Bittau’s Wedge offers a 
promising avenue of research in automated identification 
of software boundaries through dynamic analysis [5]. 

Seaborn and Hand have explored application com- 
partmentalisation on UNIX through capability-centric 
Plash [21], and Xen [15], respectively. Plash offers an 
intriguing blend of UNIX semantics with capability se- 
curity by providing POSIX APIs over capabilities, but 
is forced to rely on the same weak UNIX primitives 
analysed in Section 5. Supporting Plash on stronger 
Capsicum foundations would offer greater application 
compatibility to Capsicum users. Hand’s approach suf- 
fers from similar issues to seccomp, in that the run- 
time environment for sandboxes is functionality-poor. 
Garfinkel’s Ostia [7] also considers a delegation-centric 
approach, but focuses on providing sandboxing as an ex- 
tension, rather than a core OS facility. 

A final branch of capability-centric research 1s capa- 
bility programming languages. Java and the JVM have 
offered a vision of capability-oriented programming: a 
language run-time in which references and byte code ver- 
ification don’t just provide implementation hiding, but 
also allow application structure to be mapped directly to 
protection policies [8]. More specific capability-oriented 
efforts are E [13], the foundation for Capdesk and the 
DARPA Browser [26], and Caja, a capability subset of 
the JavaScript language [14]. 


9 Conclusion 


We have described Capsicum, a practical capabilities ex- 
tension to the POSIX API, and a prototype based on 
FreeBSD, planned for inclusion in FreeBSD 9.0. Our 
goal has been to address the needs of application au- 
thors who are already experimenting with sandboxing, 
but find themselves building on sand when it comes to 
effective containment techniques. We have discussed 
our design choices, contrasting approaches from research 
capability systems, as well as commodity access con- 
trol and sandboxing technologies, but ultimately leading 
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to a new approach. Capsicum lends itself to adoption 
by blending immediate security improvements to cur- 
rent applications with the long-term prospects of a more 
capability-oriented future. We illustrate this through 
adaptations of widely-used applications, from the sim- 
ple gzip to Google’s highly-complex Chromium web 
browser, showing how firm OS foundations make the job 
of application writers easier. Finally, security and perfor- 
mance analyses show that improved security is not with- 
out cost, but that the point we have selected on a spec- 
trum of possible designs improves on the state of the art. 
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11 Availability 
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A technical report with additional details is forthcoming. 
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Abstract 


In a bid to limit the harm caused by ubiquitous remotely 
exploitable software vulnerabilities, the computer sys- 
tems security community has proposed primitives to al- 
low execution of application code with reduced privilege. 
In this paper, we identify and address the vital and largely 
unexamined problem of how to structure implementa- 
tions of cryptographic protocols to protect sensitive data 
despite exploits. As evidence that this problem is poorly 
understood, we first identify two attacks that lead to 
disclosure of sensitive data in two published state-of- 
the-art designs for exploit-resistant cryptographic proto- 
col implementations: privilege-separated OpenSSH, and 
the HiStar/DStar DIFC-based SSL web server. We then 
describe how to structure protocol implementations on 
UNIX- and DIFC-based systems to defend against these 
two attacks and protect sensitive information from dis- 
closure. We demonstrate the practicality and generality 
of this approach by applying it to protect sensitive data 
in the implementations of both the server and client sides 
of OpenSSH and of the OpenSSL library. 


1 Introduction 


Cryptographic protocols are entrusted to preserve the in- 
tegrity and secrecy of sensitive data as it traverses a net- 
work. While these protocols incorporate strong mecha- 
nisms to defend against in-network eavesdropping and 
modification of data in transit, such protocols function 
in today’s distributed systems only as imperfect, human- 
written software. Clearly, the desired outcome for secure 
system designers implementing a secure data transfer 
protocol like SSH [13] or SSL/TLS [4] 1s end-to-end in- 
tegrity and secrecy for sensitive data, despite not only in- 
network threats, but also threats that may arise from the 
behavior of the protocol implementation(s) at the ends of 
the wire. The dismal past two decades of remotely ex- 
ploitable vulnerabilities in software deployed widely on 
network-attached hosts are thus real cause for alarm— 
even if the abstract design of a cryptographic protocol is 
correct, the protocol’s very implementation is a worry- 
ingly weak link in achieving end-to-end security goals. 
In the quest for a lasting end-to-end defense for sen- 
sitive data against disclosure or corruption by a remote 
attacker, whatever vulnerabilities and exploits come to 
light in the future, the systems research community has 


USENIX Association 


in recent years sought to put the venerable principle of 
least privilege [10] into better practice in the software 
running on network-connected servers. This design tenet 
dictates that the programmer should partition his code 
into compartments, each of which executes a portion of 
the program with minimal privilege necessary to carry 
out its function. Here, privilege corresponds to access 
rights for system resources: to read or write the filesys- 
tem, memory, or network, to invoke a system call, &c. 
In the context of exploitable vulnerabilities and sensitive 
information, least privilege amounts to designing an ap- 
plication with the expectation that exploits will occur, but 
limiting the harm that they may cause by restricting the 
actions that an attacker may take post-exploit. 


Early work [5, 9] explored how to minimize priv- 
ilege on compartments instantiated as standard UNIX 
processes. More recently, the community has devoted 
considerable effort to providing various operating system 
primitives intended to make it easier for programmers to 
adhere to the principle of least privilege. These primitives 
range from operating system support for decentralized 
information flow control (DIFC) [6, 12, 14, 15], which 
limits the privileges of any compartment exposed to sen- 
sitive information, to process-like primitives that lessen 
the likelihood of accidental propagation of privileges be- 
tween compartments against the programmer’s intent [2]. 


While these results all represent important advances 
over the prior state of the art, we believe that proposals 
to date for new primitives to encourage programmers’ 
adherence to least privilege largely ignore a central, vi- 
tal question: how should a programmer structure code 
and limit privilege to prevent disclosure or corruption of 
sensitive data by an attacker who can exploit a vulner- 
ability? Regardless of the primitives used, this daunting 
question looms. To their credit, the proposers of these 
primitives present examples of how to structure applica- 
tion code to use them. But these examples are typically 
offered as existential evidence that the primitives them- 
selves are useful; no guidance or principles are offered 
for how one may structure an application’s code to use 
the primitives and robustly provide the desired end-to- 
end secrecy and/or integrity guarantees. 


Moreover, the structures of these example applica- 
tions are complex, as they are typically split into many 
compartments. To wit, the OKWS web server spreads 
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its code among at least 5 compartments (processes) [5], 
the sthread-partitioned Apache/SSL web server consists 
of 9 compartments (sthreads and callgates) [2], and the 
HiStar/DStar-labeled Apache/SSL web server consists 
of 7 compartments (processes) [15]. Each application’s 
many compartments are configured with different privi- 
leges and labels, respectively, and interconnected in com- 
plex patterns. Structuring code to use these primitives ap- 
pears difficult. Indeed, as we show in Section 3, even 
highly security-conscious programmers using state-of- 
the-art techniques [9, 15] have not adequately considered 
how to defend cryptographic protocol implementations 
from exploit-based attacks. 

In this paper, we offer a practical improvement over 
the status quo: principles to guide programmers in struc- 
turing cryptographic protocol implementations so as to 
robustly protect sensitive user data end-to-end, including 
in cases where a remote attacker exploits untrusted ap- 
plication code. Our contributions include: 

e We define two general classes of attack on cryp- 
tographic protocol implementations: session key dis- 
closure attacks and oracle attacks. We demonstrate 
that two state-of-the-art cryptographic protocol imple- 
mentations, one in privilege-separated OpenSSH [9] 
and the other in a DIFC-labeled Apache/SSL web 
server [15], are vulnerable to these attacks. 

e We provide protocol-agnostic principles for structur- 
ing cryptographic protocol implementations to protect 
sensitive data against disclosure and corruption when 
an exploitable vulnerability is present in code that pro- 
cesses network input. 

e As evidence of the practicality and generality of these 
principles, we present restructured implementations of 
the OpenSSH server and client and of the OpenSSL 
library that limit privilege so as to protect users’ sen- 
sitive data from an adversary who can remotely ex- 
ploit the implementation. This restructured OpenSSL 
library can act as a drop-in replacement for the stock 
library, bringing robustness against these attacks to a 
wide range of SSL-enabled applications. 


2 Background 


We now summarize the state of the art in protecting sen- 
sitive data in network server software. The two main ap- 
proaches in use are privilege separation and decentral- 
ized information flow control (DIFC). 


2.1 Privilege Separation with Processes 


In a monolithic application, in which all code executes 
in a single compartment (under UNIX or Linux, a pro- 
cess), all instructions execute with full privilege. Thus, 
an exploit of a vulnerability may result in disclosure of 
sensitive data, and more generally, grants the full privi- 
lege held by the application to any code injected by the 
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attacker. Privilege separation [9] has proven effective in 
mitigating these threats. This technique follows from the 
observation that an application need not execute individ- 
ual operations with the union of all privileges needed 
by all operations during the application’s entire lifetime. 
Many vulnerability-prone operations, such as parsing, do 
not require access to sensitive information or the filesys- 
tem. If we partition a monolithic application into com- 
partments and restrict some compartments’ privileges, an 
exploit in an unprivileged compartment will not be able 
to disclose or corrupt sensitive information to which it 
does not have access. Code that runs in privileged com- 
partments, however, must be carefully audited to protect 
the sensitive data it can access. 

The privilege-separated OpenSSH server [9] divides 
the server’s code into separate standard UNIX/Linux 
processes. This partitioning includes a network-facing 
unprivileged process that performs key exchange and au- 
thentication protocols, and a privileged monitor process 
running as root that exports an interface to the unpriv- 
ileged process to allow invocation of privileged opera- 
tions, such as signing with the server’s private key, veri- 
fying user credentials, &c. 

This structure is intended to deny the attacker execu- 
tion of code with root privilege on the server; the at- 
tacker only interacts directly with the unprivileged pro- 
cess. Provos et al. state that “programming errors occur- 
ring in the unprivileged parts can no longer be abused to 
gain unauthorized privileges” [9]. This claim holds be- 
cause the unprivileged process executes with restricted 
file system access (enforced with a chroot system 
call), and with unused user and group IDs of nobody, 
which prevent it from tampering with other processes. 

The SELinux security extensions to Linux [7], which 
post-date Provos et al.’s work, allow enforcement of flex- 
ible mandatory access control policies specified by a sys- 
tem administrator. These policies support finer-grained 
restriction of a process’s privileges than under stock 
Linux, primarily by checking system call invocations in 
the kernel against a per-process access control list. We 
employ these extensions in our cryptographic protocol 
implementations for OpenSSH and OpenSSL. 


2.2 DIFC 


Decentralized information flow control (DIFC), as im- 
plemented in the research prototype operating systems 
Asbestos [12] and HiStar [14], and retrofitted to Linux in 
Flume [6], offers a different approach to limiting privi- 
lege within applications. In these systems, a programmer 
expresses an information flow policy by labeling data ac- 
cording to its sensitivity level. Should an unprivileged 
compartment access data labeled as sensitive, it becomes 
tainted, and at run-time, the operating system prevents 
it from communicating with compartments tainted with 
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Figure 1: HiStar-labeled SSL web server. We omit SSLd’s and netd’s 
labels in the interest of brevity. 


lower levels of sensitivity, or with the network or con- 
sole. This way, an unprivileged compartment cannot con- 
vey sensitive data out of the application. To allow output, 
trusted compartments perform privileged operations on 
sensitive data: they own sensitive labels, and are thus al- 
lowed by the operating system to declassify sensitive in- 
formation, stripping it of its sensitivity label(s). 

Building on these DIFC primitives, Zeldovich et 
al. present a state-of-the-art privilege-separated SSL web 
server [15], shown in slightly simplified form in Figure 1. 
Ovals represent code: shaded ovals are trusted, privileged 
compartments, while white ovals are untrusted compart- 
ments. A dashed arrow between compartments A and B 
indicates that A may invoke an operation in B with argu- 
ments and retrieve the result. Boxes represent sensitive 
data. A solid arrow from data to a compartment denotes 
that the compartment may read that data; an arrow in the 
reverse direction denotes write access. Circles annotating 
data items and compartments indicate labels; in the latter 
case, a compartment is tainted with the label in question. 
Finally, a label within a star denotes that a compartment 
owns that label (and may declassify data labeled with it). 

The HiStar-labeled SSL web server is partitioned into 
several untrusted compartments to limit the effect of 
a compromise of any single one. The major compart- 
ments are per-connection SSLd, per-connection httpd, 
and shared RSAd daemons. SSLd handles a client’s SSL 
connection and performs key exchange, server authenti- 
cation, encryption and decryption. httpd processes clear- 
text HTTP requests; it uses SSLd to decrypt requests and 
encrypt replies. httpd can obtain ownership of a user’s 
label by authenticating with the trusted authd daemon. 
Label ownership allows httpd to read the user’s data and 
declassify it for transfer over the network. The trusted 
netd serves as a barrier between the application and the 
network. It passes only declassified data (with no label) 
to the network. 


3 Attacks on Protocol Implementations 


The designers of cryptographic protocols like SSH and 
SSL aim to provide end-to-end confidentiality and in- 
tegrity for users’ data transferred during a session. When 
applied correctly, both privilege separation and DIFC can 
ensure that exploits of unprivileged compartments in a 
protocol’s implementation will not lead to violations of 
these properties. In this section, we present two attacks 
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Figure 2: Session key disclosure attack against privilege-separated 
OpenSSH server. 


that violate the confidentiality and integrity of sensitive 
user data in two state-of-the-art privilege-separated sys- 
tems: one in privilege-separated OpenSSH, and one in a 
HiStar-labeled Apache-derived SSL web server.! 


3.1 Session Key Disclosure Attack 


The partitioning goal stated by the designers of privilege- 
separated OpenSSH was to prevent attackers’ executing 
code with root privilege. However, as we will see, that 
goal is not sufficient to preserve the confidentiality and 
integrity of the user’s sensitive data. 

In prior work [2], we described an active man-in- 
the-middle attack against an SSL-enabled Apache Web 
server. This attack, which we term the session key disclo- 
sure attack (SKD attack), is also valid against a privilege- 
separated OpenSSH server. While in prior work we only 
discussed this attack against an SSL implementation, we 
now demonstrate that this attack applies against any pro- 
tocol in which the two parties share a symmetric secret 
key. 

In the SKD attack, an active man in the middle com- 
promises an unprivileged compartment on the server, dis- 
closes the user’s session key, and can then decrypt the 
sensitive data transmitted during the session. This attack 
succeeds because the unprivileged compartment respon- 
sible for key exchange and server authentication can read 
the session key shared between the server and client. We 
illustrate the SKD attack on Diffie-Hellman (DH) key ex- 
change in OpenSSH in Figure 2. Here an unprivileged 
compartment processes key exchange messages and in- 
vokes a privileged monitor to sign a session ID with 
the server’s private key (the privileged monitor is not 
shown in the figure). The user-privileged compartment 
executes with the authenticated user’s UID and provides 
a remotely accessible shell. 

The attacker begins by exploiting the server’s unprivi- 
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leged compartment. He relays all key exchange messages 
to and from a legitimate user. The server and user com- 
pute a shared session key, which the attacker’s injected 
code sends the attacker from the compromised compart- 
ment. After user authentication, the user transmits sen- 
sitive data encrypted with the compromised session key. 
Using the session key, the attacker can reveal the user’s 
sensitive data, as well as inject her own commands and 
obtain further sensitive information stored on the server. 
Moreover, the session key also provides secrecy for user 
authentication, so the password of a client using pass- 
word authentication will be compromised. 

The — state-of-the-art, HiStar-labeled SSL web 
server [15] aims to safeguard users’ sensitive data 
from disclosure to other users. We note with interest 
that because the designers of this cryptographic protocol 
implementation did not consider the SKD attack when 
structuring their code, this server is vulnerable to the 
SKD attack in the same way that the privilege-separated 
OpenSSH server is. Specifically, the untrusted SSLd 
compartment computes a session key for a_ user’s 
connection, but if an active man-in-the-middle attacker 
compromises this compartment, she may disclose the 
session key. 


3.2 Oracle Attack 


Next, consider the HiStar-labeled SSL web server shown 
in Figure 1. Depending on the key exchange protocol in 
use, RSAd signs either the ephemeral RSA key or the 
public DH components supplied by the untrusted SSLd 
with the server’s permanent private key. This signature 
authenticates the server to the client. It is possible, how- 
ever, to abuse the signing operation exported by RSAd. 
Although a compromised SSLd cannot directly read the 
private key, it can sign any data chosen by the attacker; 
the attacker controls the $SLd compartment, and can in- 
voke RSAd with any arguments she chooses. Thus, the at- 
tacker can use a compromised SSLd to produce valid sig- 
natures using the server’s identity. This example demon- 
strates that simply putting sensitive data beyond direct 
reach of untrusted code does not provide sufficient isola- 
tion. 

We name such attacks against a cryptographic proto- 
col’s partitioning oracle attacks. Any trusted compart- 
ment or sequence of trusted compartments isolating sen- 
sitive data and exporting privileged operations to un- 
trusted code can be an oracle. An oracle takes untrusted 
input from untrusted code and returns the result of a priv- 
ileged operation. An attacker can obtain sensitive infor- 
mation by invoking the trusted compartment with ap- 
propriately chosen inputs. SSLd is meant only to pass 
RSAd an ephemeral key or the DH components for its 
own current session for signing. But if an active man- 
in-the-middle attacker compromises SSLd, she can sign 
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arbitrary keys and DH components and present them to 
other users, and so impersonate the server. 

We have further identified oracle structures in the 
“baseline” privilege-separated OpenSSH server [9]. The 
trusted monitor process exposes a private key-signing op- 
eration to the unprivileged compartment for authentica- 
tion of the server during key exchange. The unprivileged 
compartment thus has an oracle for the server’s private 
key, and an attacker who compromises that compartment 
can impersonate the OpenSSH server, just as was de- 
scribed for the SSL web server above. 

While studying the SSH and SSL/TLS protocols, we 
identified further oracle attacks. Digital signatures suf- 
fer not only from signing oracles, but also verification 
oracles, in which an attacker can force successful signa- 
ture verification by supplying chosen inputs to a trusted 
compartment performing this privileged operation. There 
also exists an oracle where an attacker forces a set of 
trusted compartments generating a session key to pro- 
duce the same key used in a past user’s session; we name 
this oracle a deterministic session key oracle. Forcing 
reuse of a session key allows an attacker to replay mes- 
sages from a past session. (This particular threat exists in 
SSL’s RSA key exchange protocol.) Finally, encryption 
and decryption oracles may allow an attacker to encrypt 
arbitrary data and decrypt confidential messages. 


3.3. Discussion 


The SKD and oracle attacks are independent of the low- 
level system primitive used to limit privilege; they appear 
equally in applications built with privilege separation and 
DIFC. These attacks are made possible by weakly struc- 
tured cryptographic protocol implementations. The im- 
plementation of a cryptographic protocol should guaran- 
tee the same properties provided in the middle of the net- 
work: data confidentiality, data integrity, and robust au- 
thentication of the peers, even if untrusted compartments 
in its implementation are compromised. Avoiding SKD 
and oracle attacks requires subtle structuring of the im- 
plementation of a cryptographic protocol. 

The SKD and oracle attacks target building blocks of 
cryptographic protocols. Risk of an SKD attack exists in 
many cases where a session key and key exchange pro- 
tocol are used. Similarly, oracle attacks are associated 
with basic cryptographic operations such as encryption, 
decryption, signing, signature verification, message au- 
thentication, &c. 

We next propose guiding principles for defense against 
SKD and oracle attacks. Just as these attacks arise in 
building blocks for cryptographic protocols, these prin- 
ciples concern how to implement these building blocks 
safely. We thus believe both the attacks and defenses ap- 
ply to many cryptographic protocols.” 
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4 Principles for Partitioning 


In this section, we define principles to guide the pro- 
grammer when partitioning an implementation of a cryp- 
tographic protocol into reduced-privilege compartments. 
These principles allow preserving the key end-to-end se- 
curity properties of the protocol, even when untrusted 
compartments are compromised. Our principles are ag- 
nostic to the underlying privilege-enforcement mecha- 
nism. Thus, they may be applied in DIFC-based systems, 
in privilege-separated systems based on Linux processes, 
and in other systems. They apply both to the client and 
server sides of cryptographic protocols. 

Throughout, we assume that an attacker can compro- 
mise untrusted code and execute arbitrary code in its 
compartment, though only with the privileges allowed in 
that compartment. In this threat model, if an untrusted 
compartment acquires sensitive information or an at- 
tacker compromises a privileged compartment, we pre- 
sume she obtains sensitive information. 


4.1 Two-Barrier, Three-Stage Partitioning 


A cryptographic protocol typically shares a symmetric 
secret key between two communicating parties, used to 
compute message authentication codes (MACs) and to 
encrypt data. A key exchange protocol confidentially 
shares this symmetric key. In addition, in some applica- 
tions, the cryptographic protocol must authenticate peers 
to each other. Any authentication method that does not 
rely on transferring sensitive data, such as public key 
authentication, may be performed during the key ex- 
change protocol, before a session-key-encrypted chan- 
nel has been established. The SSL/TLS protocol fits this 
model [4]. In contrast, password-based authentication, 
e.g., aS supported by SSH [13], sends sensitive data over 
the network, and must therefore only authenticate after 
the session-key-MACed and -encrypted channel has been 
established. After authentication, an application is as- 
sured of the remote principal’s identity, and can grant the 
remote principal access to locally stored sensitive data. 

We distinguish two attack models. The first is that of 
the SKD attack described in Section 3.1, where a man- 
in-the-middle attacker exploits a vulnerability in a client 
or server application to obtain the peers’ session key. The 
second attack model is that of an impersonation attack, 
where an attacker exploits an endpoint and subverts au- 
thentication in order to impersonate one of the peers. 

In order to prevent these attacks, a partitioned applica- 
tion should implement structures that we term a session 
key barrier and a user privilege barrier. These divide 
an application into three stages, as shown in Figure 3. 
The first such stage, the session key negotiation stage, 
performs the key exchange protocol. The second stage, 
the pre-authenticated stage, conducts peer authentica- 
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Figure 3: Barriers and stages in protocol partitioning. 
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tion. Finally, the post-authenticated stage processes user 
requests. Within each stage, one untrusted compartment 
handles network input and executes without privileges to 
read or write sensitive data, while multiple trusted com- 
partments execute with privilege to access sensitive data. 
These trusted compartments export any necessary privi- 
leged operations to the untrusted compartment. 


Session Key Barrier The session key barrier denotes 
the killing of the untrusted compartment that completes 
session key negotiation and the subsequent spawning of a 
new untrusted compartment (in Linux, a process) to con- 
tinue execution in the pre-authenticated stage. We now 
explain why this structure is necessary. 

The untrusted compartment performing session key 
negotiation (before the session key barrier) is the only 
untrusted compartment in the partitioning of the crypto- 
graphic protocol implementation that processes cleartext, 
unauthenticated messages from the network. These mes- 
sages (and exploits!) may arrive from an SKD attacker. 
Thus, while the untrusted compartment in the session key 
negotiation stage interacts with the remote peer to com- 
pute the session key, it should not have read access to 
the session key. In addition, any data that allows deriving 
the session key, such as a private Diffie-Hellman compo- 
nent (in the case of Diffie-Hellman key exchange) or a 
pre-master secret (in the case of RSA-based session key 
establishment in SSL) should be also considered sensi- 
tive. All access to privileged operations with such data 
should be provided via trusted compartments. 

Because this compartment only processes messages in 
cleartext, it does not in fact need read access to the ses- 
sion key; only the next stage, the pre-authenticated stage, 
which continues execution after the channel between the 
two peers is MAC’ ed and encrypted with the session key, 
needs the session key. 


Principle 1: A network-facing compartment perform- 
ing session key negotiation should not have access to 


a session key, nor any data that allows deriving the 
session key. 





Because the untrusted compartment performing ses- 
sion key negotiation may be exploited, we cannot trust 
the provenance of the code executing in that compart- 
ment at the end of session key negotiation, and rather 
than allowing that compartment to continue execution in 
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the pre-authenticated stage, where it would have access 
to the session key, we kill it (7.e., kill the Linux process). 

But why can’t an SKD attacker exploit the untrusted 
compartment in the pre-authenticated stage? This com- 
partment only processes input that is MAC’ed using the 
now available session key. A would-be SKD attacker 
cannot inject messages with a valid MAC into the chan- 
nel, and so is precluded from exploiting this compart- 
ment. We assume here that the MAC computation func- 
tion itself, which processes network input, can be audited 
and trusted not to be exploited. 

Thus, both the MAC on the channel and the killing of 
the untrusted compartment in which session key negoti- 
ation has completed effectively erect a barrier between 
any SKD attacker and the session key. 


Principle 2: When enabling the MAC, a network- 
facing compartment performing session key negotia- 


tion should be killed, and a new one created with priv- 
ilege to access the session key. 





Principle 3: After enabling the MAC, there should be 
no unMAC’ed messages processed by the untrusted 


compartment. 





Note that the “original” privilege-separated OpenSSH 
server does in fact destroy the unprivileged compartment 
after user authentication, but we require this be done 
after key exchange. The “original’» OpenSSH destroys 
the compartment not for SKD attack-resistance reasons, 
but because of a programming difficulty. In this imple- 
mentation, the unprivileged compartment runs as user ID 
nobody, but must change its user ID to that of the au- 
thenticated user. Changing a process’s user ID requires 
root privilege; therefore, the monitor kills the compart- 
ment and creates a new one with the required user ID. 

Transitioning to the pre-authenticated stage may re- 
quire transferring state from the unprivileged compart- 
ment of the session key negotiation stage to the unpriv- 
ileged compartment of the pre-authenticated stage. As 
this state comes from a compartment that may be con- 
trolled by an SKD attacker, the pre-authentication stage 
should validate this state’s sanity to prevent an SKD 
attacker from passing bad state in an attempt to com- 
promise the pre-authenticated stage. The same problem 
arises when a privileged compartment accepts arguments 
to a privileged operation from an untrusted compartment; 
these arguments should also be verified to prevent com- 
promise of the privileged compartment. 


Principle 4: Any state exported from a compartment 
performing session key negotiation and any untrusted 


arguments passed to privileged compartments should 
be validated. 





We do not offer general techniques for verification of 
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untrusted state and arguments. However, in our parti- 
tioning of protocol implementations, we employ pipes 
for inter-process communication. Although marshaling, 
unmarshaling, and data copies cost in performance, this 
mechanism provides a recipient with an RPC-like expec- 
tation of the format of the data it receives. These RPC- 
like semantics ease state and argument verification. 

The session key barrier is enforced when an appli- 
cation switches permanently from communicating with 
cleartext messages to MAC’ed messages. Some proto- 
cols, such as SSL, however, can alternate between these 
two types of messages. In such cases, the transition be- 
tween the two stages should be performed after the last 
cleartext message. However, doing so would require pro- 
cessing messages MAC’ed and encrypted with the ses- 
sion key during the session key negotiation stage, which 
risks creating session key oracles! We address this prob- 
lem with Principle 7. 


Principle 5: A cryptographic protocol should not al- 


ternate between cleartext messages and MAC’ ed mes- 
sages. 





User Privilege Barrier The user privilege barrier rep- 
resents any authentication method that can be used to 
authenticate a peer before granting it privilege to ac- 
cess sensitive information owned by a particular user. 
This barrier prevents impersonation attacks, where an at- 
tacker exploits an application to subvert its authentica- 
tion mechanism. Authentication should be performed by 
an unprivileged compartment that has no access to sensi- 
tive user data. The pre-authenticated stage is protected by 
the session key barrier, so this stage is not exposed to any 
SKD attacker. However, it is crucial for the integrity of 
the session key barrier that there be no unMAC’ed mes- 
sages processed during the pre-authenticated and post- 
authenticated stages. Without the SKD threat, the ses- 
sion key is no longer sensitive information in the pre- 
authentication stage, and it can be accessed directly by 
unprivileged code. We allow the impersonator to access 
the session key at this point because it is his own key and 
does not correspond to any other user’s session. Success- 
ful authentication transitions the application into the next 
stage, the post-authenticated stage. 

Today’s state-of-the-art privilege-reduced applications 
implement the user privilege barrier as we require. How- 
ever, monolithic, full-privilege applications perform au- 
thentication in a privileged compartment. The privilege- 
separated OpenSSH server performs user authentication 
in an unprivileged compartment, and then the monitor 
creates a new compartment with the user ID and group 
ID of the authenticated user. The HiStar-labeled SSL 
web server supports only password authentication, and 
the unprivileged httpd daemon obtains ownership of the 
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user’s labels only after the user successfully authenti- 
cates with an authentication daemon. 

Some protocols authenticate peers without sending 
confidential data, such as passwords. For example, the 
SSL protocol’s handshake supports only public key au- 
thentication methods. Such authentication techniques 
can be merged with the key exchange protocol or per- 
formed in cleartext after it. Thus, the user privilege bar- 
rier can be established within the session key negotia- 
tion stage omitting the pre-authenticated stage. This op- 
timization is encouraged, as it reduces the number of 
stages and compartments, and thus increases the perfor- 
mance of a privilege-separated application. 

Authentication that requires passing sensitive data en- 
crypted with the session key cannot be performed dur- 
ing the session key negotiation stage. If it were, the ses- 
sion key negotiation stage would require a trusted com- 
partment to decrypt sensitive data, and that compart- 
ment would result in a session key oracle that could 
be used to decrypt the user’s sensitive data. Moreover, 
other trusted compartments would be needed to process 
authentication-related sensitive data, because we cannot 
allow untrusted code to operate with confidential data. 

The post-authenticated stage executes in a compart- 
ment with the authenticated user’s privilege; it acts 
for the authenticated user and can access his data. 
When we transition from the pre-authenticated to post- 
authenticated stage, we need not kill the former, as it can- 
not be exploited, given the MAC’ed channel precludes 
SKD attacks and the authentication barrier prevents im- 
personation attacks. Instead, we can change the privilege 
of the compartment used in the pre-authenticated stage 
to that of the authenticated user, and continue execution 
with the code for the post-authentication stage. 

We note that for some applications, the post- 
authenticated stage may require further privilege sep- 
aration. For example, an application may require ac- 
cess to a centralized database where sensitive data be- 
longing to many users is stored. In this case, the user- 
authenticated compartment should be denied direct ac- 
cess to the database, but a trusted compartment should 
export access to the database. This privilege separation, 
reminiscent of techniques explored in OKWS [5], pre- 
vents a user from accessing other users’ sensitive data. 


4.2 Oracle Prevention Techniques 


In the previous section, we described how to implement 
cryptographic protocols so as to thwart SKD and imper- 
sonation attacks. Throughout the suggested implementa- 
tion structure there is sensitive data accessible only by 
trusted compartments, which in turn export privileged 
operations to unprivileged compartments. As discussed 
in Section 3.2, in all such situations, there is a risk of 
granting an attacker an oracle for sensitive information. 
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For example, the session key negotiation stage de- 
pends on confidential session key sharing. An SKD at- 
tacker can use a trusted compartment as a decryption or- 
acle to obtain a secret component of a session key. An im- 
personator may replay authentication data from another 
connection as an input to an authentication oracle and 
pass authentication as a legitimate user. Clearly, we need 
techniques to mitigate any oracles in these stages. 


Entangle Output Strongly with Per-Session Known- 
Random Input Network protocols employ random- 
ness generated afresh for every session to defeat authenti- 
cation replay attacks, where an attacker replays messages 
eavesdropped from a user session to reestablish the past 
session and repeat a user’s past requests. The server gen- 
erates a random nonce incorporated into the session key 
(in the case of RSA key exchange) or a fresh private DH 
component (for DH key exchange) to make the session 
key different for every session. We can similarly employ 
this session randomness as a defense to counter oracles. 

The output of a trusted compartment should not com- 
pletely depend on untrusted input, so that an attacker will 
not be able to replay past input to the compartment and 
get the same deterministic result. Entangling the output 
of a privileged compartment with a trusted per-session 
random nonce solves this problem. 

For example, Figure 4 demonstrates an approach 
to preventing a signing oracle in a privilege-separated 
OpenSSH server. We restrict the trusted monitor that im- 
plements signing with the private key to sign only ses- 
sion IDs that incorporate per-session random bits. A se- 
quence of privileged operations performed by the trusted 
compartment ensures that the server’s private DH com- 
ponent is indeed included in the session ID. This way, 
we entangle the output of the RSA signing compart- 
ment/operation with trusted, per-session, known-random 
input. Numbers within trusted compartments in Figure 4 
specify the order of their invocation, and this order 
should be enforced by the application. 

With this oracle defense mechanism, the attacker can- 
not mount an impersonation attack, as every signed 
session ID will incorporate different randomness con- 
tributed by the server, and will thus not be valid in the 
context of any other session. Similarly, in order to pre- 
vent deterministic session key oracles, we make sure that 
the compartment generating the keys includes random- 
ness generated afresh for every session. Moreover, per- 
session randomness 1s crucial in prevention of signature 
verification oracles; the data for signature verification 
should also incorporate it. 


Principle 6: To prevent oracles, entangle output 


strongly with per-session, known-random input. 





In RSA key exchange in the SSL/TLS protocol, there 
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Figure 4: Prevention of private key oracle in OpenSSH server by en- 
tangling output with per-session known-random input. 
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is the potential for a deterministic session key oracle at- 
tack, where an attacker can produce a deterministic ses- 
sion key by supplying chosen inputs to a privileged com- 
partment generating the key. In particular, a session key 
consists of two public components, per-session server 
and client randoms, and a pre-master secret transmitted 
encrypted in the server’s public key [4]. When generat- 
ing the session key, these components are concatenated 
together and hashed. The server decrypts the pre-master 
secret using its private key before hashing it together with 
the other components. If an attacker controls the server 
random, client random, and encrypted pre-master secret 
inputs to the session key generation function, he can feed 
data eavesdropped from a user session to the privileged 
compartment generating the session key and produce the 
key that corresponds to the eavesdropped session. We 
prevent deterministic session key oracles by ensuring 
that every server-computed session key includes a trusted 
server nonce produced and supplied to the compartment 
generating the session key by a trusted source. This way, 
an attacker cannot control the generated session key, as 
each time it incorporates a different random nonce. 


Obfuscate Untrusted Input by Hashing The SSL 
protocol alternates cleartext change cipher spec mes- 
sages with authenticated and encrypted finished mes- 
sages [4]. A change cipher spec message signals that the 
sender is about to enable encryption and authentication 
on all subsequent messages. A finished message contains 
a MAC’ed and encrypted hash of all previous cleartext 
messages received by a peer during the handshake pro- 
tocol. The finished message ensures that these cleartext 
messages were not tampered with by an attacker. 

To ensure that the session key barrier is enforced, 
we cannot process cleartext messages in the pre- 
authenticated stage. Instead we should process the fin- 
ished messages within the session key negotiation stage. 
However, doing so requires a trusted compartment that 
performs session key encryption and decryption opera- 
tions on behalf of untrusted code. This trusted compart- 
ment is a session key encryption/decryption oracle which 
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can be used to decrypt user information and validly en- 
crypt an attacker’s exploits or requests. 

Our oracle mitigation technique provides the required 
privileged operations (encryption and decryption with a 
session key) and avoids a session key oracle by obfuscat- 
ing input data through hashing. As the finished message 
is an encrypted hash, a trusted compartment can be struc- 
tured in the following way: it obtains data from an un- 
trusted compartment, hashes the data, and then encrypts 
the resulting hash. A privileged operation that hashes 
data and then encrypts is not useful for an attacker, as the 
attacker’s requests and exploits for the pre-authenticated 
and post-authenticated stages will be viewed as hashes. 

As for the decryption oracle, we do not return the 
cleartext finished message to untrusted code. Instead, our 
trusted compartment takes the verification data from an 
untrusted compartment and performs verification of the 
finished message itself. The result of this verification is 
returned to the untrusted compartment. However, this 
mechanism allows dictionary attacks, where an attacker 
can guess the cleartext message by supplying the verifi- 
cation data. Again, obfuscating the untrusted validation 
data by hashing before comparing it with the cleartext 
finished message solves this problem. This approach fits 
the protocol because the finished message happens to be 
a hash of all previous handshake messages. If an attacker 
attempts to guess the cleartext requests, his guess will be 
hashed first, then compared with the original message. 

The hashing that we apply to prevent both oracles al- 
ready is present in the SSL handshake. But the hand- 
shake and our oracle mitigation technique use it for dif- 
ferent reasons. The handshake requires the compression 
and collision-resistance of a hash function, but our tech- 
nique employs the hash function because of its non- 
invertibility. Happily for us, the hash function provides 
all the mentioned properties and does double duty. 


Principle 7: To prevent oracles, obfuscate untrusted 





input by hashing. 


Last Resort: More Trusted Code The previous oracle 
mitigation techniques require the availability of a random 
nonce or a hash function. However, for those cases in 
which a cryptographic protocol does not specify these 
functions at a point in the protocol where there is the risk 
of an oracle, we offer a last resort technique. 

For an oracle to exist, a result of a privileged oper- 
ation must return to an unprivileged compartment. It is 
possible to avoid the oracle by making the output privi- 
leged and restricting access to it in the unprivileged code. 
Although this technique helps, it is not efficient, as a 
new trusted compartment is required to process the re- 
sult, and you may need to process the result of the new 
compartment in the same way. Our last resort technique 
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may lead to a chain of trusted compartments, which in- 
creases the trusted code base and requires more auditing 
work. Moreover, to terminate this chain, there must be a 
suitable condition for applying one of the previous oracle 
mitigation techniques, or the last trusted compartment in 
the chain must not produce any output. 


Principle 8: To prevent oracles, as a last resort, add 
more trusted code. 


4.3. Degrees of Sensitivity 


Cryptographic protocols often operate on sensitive data 
of more than one class. As an example, one frequently 
occurring class of sensitive data is that which must be 
kept secret to ensure secrecy and integrity of data trans- 
ferred within a single session, e.g., the pre-master secret 
in RSA key exchange, the private DH component in DH 
key exchange, the session key, the per-session ephemeral 
RSA private key, &c. Disclosure of such sensitive data 
results in violation of the secrecy and/or integrity of sen- 
sitive data within a single session. Yet there is often an- 
other class of even more sensitive data that must remain 
secret in order to preserve the secrecy of user data in 
many sessions. This class includes a server’s private key, 
users’ private keys, and passwords that are reused on 
many servers. The secrecy of such data is vital because 
an attacker can use it to gain access to user data in mul- 
tiple sessions by impersonating the server, or by using 
users’ passwords to access many servers. 

In a simple scenario like this one involving two classes 
of sensitive data—that which 1s critical to one session’s 
secrecy vs. that which is critical to ensuring many ses- 
sions’ secrecy—mixing sensitive data of both classes and 
code to manipulate data of both classes in the same com- 
partment incurs warrantless risk. To see why, let’s devi- 
ate from our threat model and assume that an attacker 
can compromise trusted compartments. Now any vulner- 
ability in code that manipulates sensitive data pertaining 
to one session’s secrecy can disclose sensitive data that 
could compromise secrecy of all sessions. Creating dis- 
tinct compartments for data of differing degrees of sen- 
sitivity (and the code that manipulates it) mitigates this 
risk. Similarly, to prevent disclosure of one user’s data to 
another, separate compartments should manage sensitive 
session-related key data for each user. 


Principle 9: A privilege-separated application should 
manage a session with two separate privileged 
compartments—one to operate with data related to se- 


crecy of the current session, and one to manage data 
that preserves secrecy of many sessions. 





Isolating code and data in distinct compartments ac- 
cording to their sensitivity often reduces trusted code 
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base size; the quantity of code with privilege with respect 
to one piece of data decreases. 


5 Hardened SSH Protocol Implementation 


We now demonstrate these principles for preventing 
SKD and oracle attacks by finely privilege-separating the 
implementations of the client and server sides of the SSH 
protocol. 

Recent privilege separation and DIFC work focuses on 
server applications, as they accept connections and can 
thus be attacked at will. But the rise of web browser ex- 
ploits demonstrates that client code is equally at risk. An 
attacker can set up a public service and provide access 
to it via SSH. By exploiting vulnerabilities in the SSH 
client implementation, the attacker can obtain users’ pri- 
vate keys, used to authenticate them to other legitimate 
SSH servers. These keys allow the attacker to obtain or 
tamper with the user’s sensitive information stored at 
these other SSH servers. Moreover, as the SKD attack is 
equally valid on both sides, server and client, protection 
against it is equally needed on the two sides. 

Throughout this paper, the baseline OpenSSH server 
design we refer to is that of Provos et al. [9]. While this 
OpenSSH server implements privilege separation, it al- 
lows unprivileged code access to the session key (contra- 
vening Principles 1 and 2) and to sign a session ID pro- 
vided by unprivileged code (contravening Principle 6), 
and thus is vulnerable to SKD and oracle attacks. We 
show how to partition the server more finely to prevent 
these attacks. But first, we focus on the OpenSSH client, 
which to date has only existed in monolithic form, and is 
thus also vulnerable to both attacks. 


5.1 Hardened OpenSSH Client 


The OpenSSH client runs under the invoking user’s user 
and group IDs. Because changing the user ID to nobody 
and invoking the chroot system call require root 
privilege, they cannot be used here. Instead, we limit 
the privilege of the trusted and untrusted compartments 
of the OpenSSH client with SELinux policies [7], and 
the SELinux type enforcement mechanism in particular. 
SELinux policies allow us to restrict untrusted processes 
from issuing unwanted system calls such as pt race, 
Open, COnnect. &c. Our prototype supports only 
password and public key authentication, and does not yet 
implement advanced SSH functionality (tunneling, X11 
forwarding, or support for authentication agents). 

Our hardened OpenSSH client starts in the ssh_t do- 
main, defined as a standard policy in the SELinux pack- 
age for the original monolithic SSH client. This policy 
provides the union of all privileges required by all code 
in the SSH client; i.e. an application in the ssh_t do- 
main may open SSH configuration files, access files in 
the /tmp directory, connect to a server using a network 
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Figure 5: Architecture of privilege-separated OpenSSH client. Shaded 
ovals denote privileged compartments. Unshaded ovals denote unpriv- 
ileged compartments. The last line in each oval denotes the SELinux 
policy enforced. 


Session monitor 

1) DH_priv_key = gen_DH_priv_key() 

2) DH_pub_key = comp_DH_pub_key(DH_priv_key) 

3) sess_key = comp_sess_key(DH_priv_key, 
srvr_DH_pub_key) 

4) sess_ID‘ = comp-_sess_ID(sess_key, clnt_version, 
srvr_version, clnt_kexinit, srvr_kexinit, ...) 


5) sym_keys = derive_sym_keys(sess_ID’, sess_key) 

6) srvr_pub_key’ = verify _srvr_pub_key(srvr_pub_key, 
known_hosts_file) 

7) verify_ sig(sess_ID’, srvr_pub_key’, sig) 


Private key monitor 
1) sig = priv_key_sign(priv_key, sess_ID’, user_name, 
service, auth_mode, ...) 





Figure 6: Privileged operations performed by the two client monitors. 
Sensitive data appear in bold, and are accessible only by the monitor 
compartment in which they appear. Untrusted parameters provided by 
unprivileged compartments are not in bold. x’ denotes that sensitive 
data x is exported to an unprivileged compartment read-only. 


socket, create a pseudo-terminal device, &c. We use this 
domain to initialize the client application and connect to 
the requested SSH server. At this point, the client has 
not yet processed any data from the server. Before ex- 
changing any SSH protocol messages, the client creates 
two new processes (compartments): a privileged session 
monitor that performs privileged operations on sensitive 
data that can compromise only a single SSH session, 
and a private key monitor that performs authentication 
operations with the client’s private keys. This ensemble 
of three compartments (represented by ovals) appears in 
Figure 5. The use of two distinct monitors is motivated 
by Principle 9. 

The session monitor runs in the ssh_monitor_t domain, 
a domain we have defined that confines the process to 
access only the known_hosts file; to read/write UNIX 
sockets for communicating with the private key monitor 
and an unprivileged process running untrusted code (de- 
scribed below); and to read/write a terminal device. The 
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session monitor cannot create or access any files apart 
from known_hosts, nor may it create new sockets. The 
private key monitor runs in the ssh_pkey_t domain, a do- 
main we have defined with a similarly tight policy, al- 
lowing it only to read the user’s private key(s), with no 
access to other files, nor privilege to create any sockets. 
The private key monitor shares a UNIX socket with the 
session monitor and only accepts requests from the latter. 
After creating these two monitor processes, the original 
SSH client process drops privilege to the ssh_nobody-_t 
domain. Untrusted code runs in this unprivileged process 
and domain during the rest of the SSH client’s execu- 
tion. The ssh_nobody_t domain allows the unprivileged 
process to communicate with the session monitor and re- 
mote server via previously opened sockets, but prevents 
it from opening any new ones. The ssh_nobody_t domain 
further denies all access to the file system, allowing the 
unprivileged process access to the terminal device only. 

The session monitor compartment isolates all sensi- 
tive data that can be used to compromise the current re- 
mote login session, and performs all privileged opera- 
tions with these data, enumerated in Figure 6, that are es- 
sential for key exchange and prevention of a private-key 
oracle. When a privileged operation takes non-sensitive 
data as input, the non-sensitive input is supplied by the 
unprivileged compartment. Symmetric keys (sym_keys) 
are the keys derived from the session key for the MAC 
and encryption/decryption. The session monitor enforces 
the order in which an untrusted compartment may invoke 
its privileged operations. 

The private key monitor isolates the client’s private 
key and performs signing operations with the key. Only 
the session monitor may invoke these signing operations 
in the private key monitor (over a UNIX-domain socket), 
and it provides the session ID to be signed as an argu- 
ment. We give a more detailed explanation of the private 
key signing operation at the end of this section. 


Session Key Negotiation Stage We now consider the 
first stage of the hardened OpenSSH client, the session 
key negotiation (SKN) stage, designed to thwart SKD at- 
tacks (described in Section 3.1). In the SKN stage, an 
unprivileged compartment—with the help of the session 
monitor—performs Diffie-Hellman key exchange to ne- 
gotiate a session key and authenticate the server. In ac- 
cordance with Principle 1, we restrict the SKN stage 
to run in an unprivileged compartment that cannot ac- 
cess sensitive data—not the DH private key, nor the ses- 
sion key, nor the symmetric keys (as shown in Figure 6). 
Keeping the session key secret (and thus thwarting an 
SKD attack) requires in turn keeping this data secret. 
We must also prevent a verification oracle attack 
against the client at this point in the handshake. Suppose 
the attacker wants to impersonate a server to the client, 
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and can trick the client into connecting to a server he 
controls, instead of to the bona fide server intended by 
the client. Suppose further that the attacker exploits the 
client. To authenticate the server, the client must verify 
the server’s public key against the list of trusted public 
keys in the known_hosts file, and then validate the 
server’s signature on the session ID. Once the attacker 
exploits the client, if the exploited compartment of the 
client implementation allows invocation of signature ver- 
ification operation with the session ID or server’s public 
key provided by this compartment then the attacker may 
be able to force signature verification to succeed, and 
thus spoof the bona fide server to the client. To see why, 
note the arguments to the signature verification routine 
verify_sig() in the session monitor in Figure 6. If the at- 
tacker controls the values of the signature argument and 
either the session ID argument or the server public key 
argument, he can provide inputs that will cause the signa- 
ture to verify. That is, he can either sign a benign sess_ID 
with his own private key and supply his own correspond- 
ing srvr_pub_key, or supply a bogus sess_ID signed by 
the bona fide server (readily obtained from the attacker’s 
own connection to the bona fide server), along with the 
bona fide server’s true srvr_pub_key. 

To prevent this verification oracle, we must not al- 
low an unprivileged compartment (at risk for exploit) 
to provide either srvr_pub_key or sess_ID to verify_sig(). 
We thus perform signature verification in the session 
monitor, and isolate sess_ID and srvr_pub_key within 
the monitor. In actuality, the untrusted compartment 
provides srvr_pub_key to the session monitor, but the 
session monitor validates it against the contents of 
the known_hosts file before verifying the signature. 
Note that sess_ID is entangled with trusted random bits 
generated by the client every new session, originat- 
ing from the client’s DH_priv_key via comp-sess_key() 
and comp_sess_ID(). This construction, specified by the 
OpenSSH protocol, implicitly applies Principle 6, which 
further prevents an attacker from forcing sess_ID to 
match that from a past eavesdropped session. 

We now turn our attention to the next steps taken by 
the client. In the OpenSSH protocol, session key nego- 
tiation and server authentication, which establishes the 
user privilege barrier, are intertwined. Therefore, our par- 
titioning of OpenSSH needs no distinct pre-authenticated 
stage, and the SKN stage proceeds immediately to the 
post-authenticated stage. 


Post-authenticated Stage After computing symmet- 
ric keys and authenticating the server, the client kills 
the untrusted compartment from the SKN stage and cre- 
ates a new untrusted compartment, also confined to the 
ssh_nobody_t domain, to execute operations in the post- 
authenticated stage. This new compartment is granted ac- 
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cess to the session’s symmetric keys so that it can per- 
form encryption and decryption operations. It may in- 
voke privileged operations in the session monitor, and 
the session monitor can invoke privileged operations on 
the client’s private keys by the private key monitor. To do 
so, the private key monitor executes with the privilege to 
read private key files. 

In the post-authenticated stage, the server authenti- 
cates the client. Our prototype supports password and 
public key authentication. Password authentication does 
not require any further partitioning of the client to pro- 
tect against a malicious server, as the SSH protocol re- 
quires that the client sends the password to the server. 
However, we can apply fine-grained privilege separation 
to deny the server access to the client’s private key(s). 
There is no need for the untrusted compartment to have 
direct access to the keys, and if it does, a malicious server 
that the user logs in may exploit the client and obtain its 
private keys, and thus obtain sensitive information from 
other SSH servers where the user authenticates himself 
using the same private keys. Therefore, we isolate the 
client private keys from the post-authentication stage’s 
untrusted compartment by placing them in a privileged 
private key monitor. To prevent a private key signing or- 
acle in the client, we do not allow the untrusted compart- 
ment to directly invoke signing data of its own choice 
using the private key. The untrusted compartment passes 
untrusted input (user name, service name, authentication 
mode, &c.) via the session key monitor. Note that we rely 
on session key monitor to supply the trusted session ID 
computed earlier in the key exchange protocol to the pri- 
vate key monitor as shown in Figure 6. Recall that the 
session ID has been entangled with trusted random bits 
generated by the client for the current session. Thus, the 
signature produced by the private key monitor will not 
be valid in any session but the current one, and a private 
key oracle has been disseminated. 

To support session key rekeying, the unprivileged pro- 
cess 1s permitted to invoke privileged rekeying operations 
implemented by the session monitor. 


5.2 Hardened OpenSSH Server 


In accordance with Principle 9, we extend the baseline 
privilege-separated OpenSSH server with an extra ses- 
sion monitor process that handles sensitive data related 
to a single user’s session while preventing an SKD at- 
tack and both private key signing and signature verifi- 
cation oracles, as shown in Figure 7. The private key 
monitor is the original monitor process from the baseline 
privileged-separated OpenSSH server, which performs 
operations that require root privilege. 

The session monitor, the unprivileged SKN process, 
and the unprivileged process of the pre-authentication 
stage all run in a chrooted environment with an unused 
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Figure 7: Architecture of hardened OpenSSH server. 


UID, under a restrictive SELinux policy that allows only 
the system calls implied in Figure 7, and prohibits all 
others, including dangerous ones such as pt race and 
connect. The process for the post-authenticated stage 
runs with the UID of the authenticated user and is not 
restricted with any SELinux policy, as with the baseline 
OpenSSH server. 


Session Key Negotiation Stage The session monitor 
implements the privileged operations required for the 
SKN stage, and we ensure that the pre-authenticated 
stage does not start unless the unprivileged compartment 
of the SKN stage terminates (in accordance with Princi- 
ple 2). Because the Diffie-Hellman key exchange proto- 
col is symmetric between the server and client, we im- 
plement operations 1—5 from Figure 6 in the server’s ses- 
sion monitor just as in the client’s. The SKD attack is 
an equally serious threat for client and server; as both 
parties share the same session key, an SKD attacker can 
compromise either party’s code to disclose it. 

During the SKN stage, the server authenticates itself 
to the client by signing a session ID. The monitor in the 
baseline privilege-separated OpenSSH server signs any 
data supplied by the untrusted compartment, thus allow- 
ing an oracle attack. A man-in-the-middle attacker can 
interpose himself between a client and a bona fide server 
and employ a signing oracle on the server to impersonate 
the server by producing valid signatures on session IDs 
corresponding to the attacker’s session with the client. 
We prevent such attacks by constraining the private key 
monitor to sign only data provided by the trusted session 
monitor—specifically, the current session ID entangled 
with trusted random bits provided by the server, as shown 
in Figure 4, as suggested by Principle 6. The server’s ses- 
sion monitor produces this sess_ID in operation 4 in Fig- 
ure 6, just as the client’s does. This signed sess_ID can- 
not be used to impersonate the server as it is only valid 
within the current session. To perform the signing opera- 
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tion, the session monitor calls into the privileged private 
key monitor and supplies the required trusted sess_ID to 
sign. 


Pre-authenticated and Post-authenticated Stages 
The baseline privilege-separated OpenSSH server sepa- 
rates the pre-authenticated and post-authenticated stages. 
It performs user authentication operations such as pass- 
word verification and signature validation (in public key 
authentication) in the monitor. However, this architec- 
ture allows an SKD attacker to compromise the password 
during password authentication, as it is encrypted with 
the session key obtainable by the attacker. During public 
key authentication, the untrusted compartment supplies 
the data used for user signature verification, again allow- 
ing oracle attacks against user authentication. The mon- 
itor validates the signature against the session ID sup- 
plied earlier when the untrusted compartment requested 
the server’s signature on this session ID. Thus the un- 
trusted compartment can control the session ID used in 
public key authentication of the user. In order for an at- 
tacker to impersonate the client, she must provide some 
session ID signed by the client for the server’s verifica- 
tion operation. It is unlikely that the attacker can force a 
user to sign arbitrary data with his private key. However, 
an SKD attacker can compromise the user’s session and 
log its session ID and signature pair. She can then replay 
these data to the server’s signature verification compart- 
ment. Because the server’s signature verification routine 
does not check whether the provided session ID is valid 
within the current session, the verification routine will re- 
port that the client has authenticated successfully. In this 
way, the attacker successfully impersonates the user. 

In our implementation, we fix this problem by making 
sure that the session ID used for signature verification is 
produced by the session monitor, as done in operation 4 
in Figure 6, and entangled with trusted random bits pro- 
vided by the server. Our SKN stage also ensures the se- 
crecy of user passwords by thwarting SKD attacks. 


Discussion: Trusted Code Base Figure 8 compares 
the trusted code bases of Provos et al.’s baseline 
privilege-separated OpenSSH server and our hardened 
OpenSSH server. The latter implements two monitors, 
in accordance with Principle 9, and as described in Fig- 
ure 7: one private key monitor that implements code re- 
quired for user authentication and accessing the server’s 
private key, and one session key monitor that contains 
the privileged code for processing the sensitive state for 
a user’s session. Consider operations 1-5 in Figure 6, 
which are essential to protection against SKD and oracle 
attacks. In our partitioning, the session monitor imple- 
ments these five operations, while in baseline OpenSSH, 
the untrusted compartment implements them. 
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At first glance, one might remark that our partitioning 
therefore incorporates more privileged code than base- 
line OpenSSH. But that assessment is flawed. Rather, the 
sensitive state pertaining to a user’s session was incor- 
rectly deemed non-sensitive data in baseline OpenSSH. 
Hence, we show baseline OpenSSH’s untrusted pro- 
cess as shaded—notation for privileged—because it is 
already (albeit inappropriately) privileged to manipu- 
late sensitive per-session data. Following the partitioning 
principles we have offered leads to the correct treatment 
of this data as sensitive, the creation of a new privileged 
compartment that can exclusively manipulate this data 
(the session monitor), and the reduction of privilege for 
all remaining code from baseline OpenSSH’s untrusted 
process (denoted in the figure as “unprivileged code’’)! 


6 Hardened OpenSSL Library 


Toward demonstrating the generality of the partitioning 
principles presented in Section 4, we have also applied 
them to the SSLv3 and TLSv1 cryptographic protocol 
implementations in the OpenSSL library. As partition- 
ing in accordance with these principles requires a fair 
amount of programmer effort, we found the OpenSSL 
library a particularly attractive target; hardening the 
library allows amortizing one partitioning effort over 
a broad range of security-conscious applications. The 
resulting hardened OpenSSL library is a drop-in re- 
placement that renders any SSL/TLS application linked 
against it immune to SKD and oracle attacks. We note, 
however, that changing the library alone cannot ensure 
that the application atop the library itself handles sensi- 
tive data securely. For example, the Apache web server 
reuses worker processes across requests submitted by 
different users. If an attacker exploits a worker process, 
he may be able to obtain sensitive data belonging to the 
next user whose request is handled by that process. 

We finely partition both the client and _ server 
sides of OpenSSL. Our implementation supports RSA, 
ephemeral RSA, Diffie-Hellman, and ephemeral Diffie- 
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Hellman key exchange, client and server authentication, 
and session caching. The OpenSSL partitioning is in fact 
similar in structure to that of SSH, as these protocols 
protect against similar threat models. When an applica- 
tion invokes SSL_accept (on the server) or SSL_connect 
(on the client), we instantiate private key monitor, session 
key monitor, and unprivileged SKN compartments. Our 
implementation scrubs the server’s private key from the 
session key monitor process and the unprivileged SKN 
compartment before reading any input from the network. 
Within the SKN stage, we apply the same principles and 
mechanisms as we did to OpenSSH to prevent SKD and 
oracle attacks. As SSL/TLS supports only public key au- 
thentication, its partitioning omits the pre-authentication 
stage. We apply simple SELinux policies (whose details 
we elide in the interest of brevity) to limit the privilege of 
the untrusted SKN compartment and the session monitor 
in applications that do not run as root. When the SKN 
stage completes, the unprivileged compartment and ses- 
sion monitor are terminated, and execution continues in 
the application’s fully privileged compartment. The pri- 
vate key monitor preserves the privileges of the appli- 
cation before entering the SSL_accept and SSL_connect 
library calls. Therefore, this compartment continues exe- 
cution of the application’s code and can use the symmet- 
ric key computed during the SSL handshake to perform 
MAC and encryption/decryption operations on the estab- 
lished SSL/TLS session. 


We have tested this hardened OpenSSL library with 
a number of client-side and server-side applications, 
including the server and client sides of stunnel, the 
mutt and mailx mail agents (for IMAP and POP3 over 
SSL/TLS), the dovecot IMAP and POP3 server, the 
client and server sides of the sendmail mail transfer agent 
(for SMTP over SSL/TLS), and the Apache HTTPS 
server (versions 1.3.19 and 2.2.14). 


Converting most of these applications was straight- 
forward; it merely required replacing the OpenSSL li- 
brary and making a one-line change to the application’s 
SELinux policy, without any application code modifica- 
tions. Apache, however, required code modifications— 
not to protect against SKD and oracle attacks, which the 
partitioned OpenSSL library defends against, but to pro- 
tect sensitive data after the SSL handshake completes. As 
noted above, Apache reuses worker processes to serve 
successive users’ requests. We modified Apache to en- 
force inter-user isolation: to ensure that an attacker’s ex- 
ploit of a worker cannot disclose the sensitive data of 
the next user to connect to the same worker. We com- 
pare two implementations of this isolation. The first is a 
naive one in which Apache kills a worker after it serves 
one request and forks another to replace it. As the over- 
head of fork 1s significant, we compare against an op- 
timized implementation based on checkpoint-restore, as 
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Figure 9: Latency of operations in OpenSSH 5.2p1 client/server, mailx 
12.4, dovecot 1.2.10, and sendmail client 8.14.4 using baseline and 
hardened OpenSSL 0.8.9k library. Run on Dell desktop with 1.86 GHz 
Intel Core 2 6300 CPU and 1 GB RAM running Linux 2.6.30. 
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Figure 10: Throughput of sendmail server 8.14.4 and indicated com- 
bination of Apache web server (httpd) 2.2.14 with OpenSSL 0.8.9k li- 
brary. Run on Sun X4100 server with 2.2 GHz AMD Opteron 248 CPU 
and 2 GB RAM under Linux 2.6.32. 


proposed by Bittau [1]. In this approach, Apache takes 
a snapshot of each new worker process’s pristine mem- 
ory image before it serves any requests, and after each 
request, a trusted monitor process restores the worker’s 
memory image to this pristine state. 

With or without this unrelated application-level 
change, Apache 1.3.19 and 2.2.14 run with the hardened 
OpenSSL library as a drop-in replacement for the stock 
OpenSSL library. 


7 Evaluation 


We now consider the cost of defending against SKD 
and oracle attacks in cryptographic protocol implemen- 
tations. As the principles given in Section 4 demand ad- 
ditional isolation between code and data, and thus addi- 
tional processes, performance is a concern: both process 
creation and context switches incur overhead. To explore 
the extent of these overheads, we compare the perfor- 
mance of the baseline OpenSSH and OpenSSL-enabled 
applications with that of the implementations hardened 
in accordance with the principles we have propounded. 
We consider in turn the end-to-end metrics of operation 
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latency (important to users) and server-side throughput 
(important to server operators). 

Figure 9 compares operation latencies for a range 
of applications. Each application is either client-side or 
server-side; in each case, the complementary remote peer 
runs the baseline cryptographic protocol implementa- 
tion. All connections are made over the loopback inter- 
face to a locally running server. For OpenSSH, we report 
the latency of logging into an SSH server using public 
key authentication and running the exit command. The 
remaining applications use the OpenSSL library. For the 
mailx email client and dovecot IMAP server, we measure 
the time required for the client to connect over SSL/TLS, 
check for new mail, and exit. For the sendmail client, we 
measure the time required to connect and send a one-line 
email to a sendmail server over SSL/TLS. For these ap- 
plications, the latency a user perceives does not increase 
significantly between the baseline and hardened crypto- 
graphic protocol implementations. 

In Figure 10, we consider the throughput achieved by 
an SSL/TLS-enabled sendmail server and HTTPS server, 
both based on the OpenSSL library. For the sendmail 
server, we submit emails over SSL/TLS from multiple 
clients and report the maximum load the server can sus- 
tain in requests (emails) per second. Introducing oracle 
and SKD defenses into the OpenSSL library negligbly 
affects the sendmail server’s throughput. 

To determine the maximum load the Apache (httpd) 
web server can sustain, we increase the number of clients 
requesting a small static page over HTTPS until the num- 
ber of requests served per second reaches a maximum. 
Clients make new SSL/TLS connections for each re- 
quest. As noted in Section 6, apart from adding defenses 
against SKD and oracle attacks, we further modified the 
baseline Apache implementation to isolate users who 
successively connect to the same worker from one an- 
other. To distinguish the cost of inter-user isolation from 
that of defending against SKD and oracle attacks, we 
measure the throughput of several Apache implemen- 
tations: baseline Apache, in which workers are reused 
across requests, so users are not mutually isolated; a 
hardened Apache with inter-user isolation implemented 
with one fork per request, without oracle or SKD attack 
defenses; and a hardened Apache with inter-user isola- 
tion implemented with three forks per request, with 
oracle and SKD attack defenses. To explore the role 
of isolation primitives in performance, we also imple- 
mented versions of hardened Apache that use optimized 
checkpoint-restore primitives [1] rather than fork. We 
further consider Apache’s performance in two extremes 
of operation: when no SSL sessions are cached and when 
all sessions are cached. We configure HTTPS clients to 
use RSA key exchange when establishing an SSL/TLS 
session because this protocol is less computationally in- 
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tensive for the server than ephemeral Diffie-Hellman key 
exchange, and thus better exposes the overhead of hard- 
ening. 

Returning to Figure 10, let us first consider the work- 
load in which no SSL/TLS sessions are cached, running 
on the hardened versions of Apache implemented using 
checkpoint-restore. End-to-end, the version of Apache 
providing both inter-user isolation and defenses from or- 
acle and SKD attacks achieves more than half (55%) the 
throughput of baseline Apache, which provides none of 
these security benefits. The overhead of these security 
mechanisms is masked in part by the computational costs 
of the cryptographic operations required to establish a 
new SSL/TLS session. We note that this “fully” hardened 
version of Apache achieves over 70% the throughput of 
one that provides inter-user isolation with checkpoint- 
restore but omits oracle and SKD attack defenses—so 
for this workload using these isolation primitives, oracle 
and SKD attack defenses incur only moderate overhead. 

In the workload in which no SSL/TLS sessions are 
cached, there are no public-key cryptographic opera- 
tions, so the overheads of inter-user isolation and oracle 
and SKD attack defenses are more exposed. Focusing on 
the implementations built on checkpoint-restore, Apache 
with inter-user isolation (but without oracle/SKD de- 
fenses) achieves 60% of the throughput of baseline 
Apache; this reduction is the cost of inter-user isola- 
tion. Adding oracle and SKD defenses to the inter-user- 
isolated implementation further reduces throughput by 
60%; that is the incremental cost of oracle and SKD de- 
fenses on this challenging workload. End-to-end, this last 
version of Apache, which incorporates all defenses and 
inter-user isolation, achieves only about one quarter of 
the throughput of baseline Apache (which lacks any of 
these security enhancements). We stress that while this 
throughput reduction is significant, it represents atypi- 
cally worst-case behavior: all sessions cached (never the 
case) and static content. On servers that distribute dy- 
namically generated content, the overhead of protecting 
users’ sensitive data will be amortized over far more ap- 
plication computation. 

The original applications based on the OpenSSL li- 
brary used single-process, monolithic designs. Harden- 
ing against SKD and oracle attacks requires three pro- 
cesses per SSL/TLS session: a private key monitor, a 
session monitor, and an unprivileged compartment for 
the SKN stage. Similarly, the hardened OpenSSH server 
and client use four processes per SSH session vs. the two 
employed by the baseline privilege-separated OpenSSH 
server. Apart from the process creation and page fault 
costs associated with fork and the memory copy costs 
associated with checkpoint-restore, anti-SKD and anti- 
oracle hardening incur overhead for additional context 
switches and the marshaling and unmarshaling of ar- 
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guments and return values between compartments con- 
nected by pipes. 

Again for the uncached workload, consider the 
throughput achieved by the full checkpoint-restore ver- 
sion of Apache (all defenses) vs. that achieved by one 
with the same full set of defenses, but implemented 
naively with fork. Checkpoint-restore offers a 20% 
throughput improvement over fork. While the end-to- 
end cost of inter-user isolation and oracle and SKD de- 
fenses is significant, the design of the underlying prim- 
itives used to implement compartments, though beyond 
the scope of this paper, appears to play a significant role 
in determining end-to-end performance. 


S$ Related Work 


Provos et al. describe privilege separation, which de- 
nies enhanced system privileges to unauthorized attack- 
ers who exploit an application [9]. They reduce privilege 
in the OpenSSH server by partitioning it into an untrusted 
process and a privileged monitor. Our work tackles the 
different goal of preventing disclosure of users’ sensitive 
data in cryptographic protocol implementations. This 
goal incorporates preventing privilege escalation. We ex- 
tend the partitioning of the privilege-separated OpenSSH 
server to comply with this goal. 

OKWS is a toolkit for building secure Web ser- 
vices [5]. It employs similar privilege enforcement mech- 
anisms as privilege-separated OpenSSH—processes, the 
nobody user ID, and the chroot system call—to iso- 
late distrusted Web services from the system they are 
running on and each other. Our complementary goal has 
been to protect sensitive data by hardening cryptographic 
protocol implementations against exploit. 

HiStar [14] enforces privileges on compartments with 
labels and DIFC. DStar [15] extends this approach to a 
distributed environment without fully trusted machines. 
Zeldovich et al. partition an SSL server to mitigate the 
effect of a compromise of any single compartment and 
prevent disclosure of user data. However, as we have de- 
scribed, it is possible to disclose users’ sensitive data 
from the SSL server using SKD and oracle attacks. 
The insufficient partitioning of the SSL protocol allows 
these attacks. Our work is complementary to work on 
DIFC systems: they are privilege-enforcement mecha- 
nisms, while we provide guidance on how to structure 
code for cryptographic protocols. 

We first discovered an instance of the attack we have 
generalized in this paper as the SKD attack during prior 
work with colleagues on Wedge [2], a set of primitives 
and tools for fine-grained partitioning of applications on 
Linux. While we presented an ad hoc defense for one 
narrow instance of the attack in that work, we offered no 
general characterization of it nor solution to it. By con- 
trast, in this paper, we offer design principles that defeat 
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the SKD and oracle attacks and that we believe are gen- 
eral enough to apply to many cryptographic protocols. 

The partitioning principles and attack mitigation tech- 
niques we have offered might also find fruitful use 
in capability-based systems such as KeyKOS [3] and 
EROS [11]. While capabilities provide convenient means 
to restrict privileges, programmers need guidance in how 
to apply them to protect sensitive data. 


9 Conclusion and Future Work 


We have described two practical exploit-based attacks 
on cryptographic protocol implementations, the session 
key disclosure (SKD) attack and oracle attack, that can 
disclose users’ sensitive data, even in state-of-the-art, 
reduced-privilege applications such as the OpenSSH 
server and HiStar-labeled SSL web server. Privilege sep- 
aration and DIFC will not secure the user’s sensitive 
data against these attacks unless an application has been 
specifically structured to thwart them. 

The principles we have offered guide programmers in 
partitioning cryptographic protocol implementations to 
defend against SKD and oracle attacks. In essence, fol- 
lowing these principles reduces the trusted code base of 
an application by correctly treating session key mate- 
rial and oracle-prone functions as sensitive, and limiting 
privilege accordingly. 

To demonstrate that these principles are practical, we 
newly partitioned an OpenSSH client and extended the 
partitioning of a privilege-separated OpenSSH server. 
Further experience with the OpenSSL library suggests 
they may generalize to other cryptographic protocols; 
they are broadly targeted at protocols that negotiate ses- 
sion keys and perform common cryptographic opera- 
tions. While we hope these principles will serve as a 
useful guide where there was none, we note that their 
application requires careful programmer effort. Still, our 
experience with OpenSSL shows that hardening a library 
once brings robustness against these attacks to the several 
applications that reuse that library. 

The latency cost of defending against SKD and ora- 
cle attacks is well within user tolerances for all applica- 
tions we measured. Defending against SKD and oracle 
attacks does exact a cost in throughput on a busy SSL- 
enabled Apache server, however, reducing the uncached 
SSL/TLS session handshake rate of a server that isolates 
users by just under 30%, and the cached rate by 60%. 
While that cost is significant, as our comparison of fork 
and checkpoint-restore demonstrates, it depends heavily 
on the performance of underlying isolation primitives—a 
topic we believe merits further investigation. 

Finally, while we have relied upon manual study of the 
SSH and SSL/TLS protocols and their implementations 
to discover the attacks we have presented, we intend to 
explore tools that use static and dynamic analysis to ease 
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discovery of such vulnerabilities in cryptographic proto- 
col implementations. 
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Notes 


'While we did not implement these two attacks, we present analysis 
of the protocols and implementations demonstrating they are possible. 

* While space limits us to illustrating these attacks and defense prin- 
ciples in the context of SSH and SSL/TLS, we have found they apply 
equally to IPSec, CRAM-MD%, and other secure protocols. 

Alternatives to SELinux include limiting a process’s privileges 
with Systrace [8], ptrace, and chroot (though the latter requires 
making a client application setuid root). 
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Abstract 

Current Electronic Toll Pricing (ETP) implementa- 
tions rely on on-board units sending fine-grained loca- 
tion data to the service provider. We present PrE’'T'P, a 
privacy-preserving ETP system in which on-board units 
can prove that they use genuine data and perform cor- 
rect operations while disclosing the minimum amount of 
location data. PrE’TP employs a cryptographic proto- 
col, Optimistic Payment, which we define in the ideal- 
world/real-world paradigm, construct, and prove secure 
under standard assumptions. We provide an efficient im- 
plementation of this construction and build an on-board 
unit on an embedded microcontroller which is, to the best 
of our knowledge, the first self-contained prototype that 
supports remote auditing. We thoroughly analyze our 
system from a security, legal and performance perspec- 
tive and demonstrate that PrE’T'P is suitable for low-cost 
commercial applications. 


1 Introduction 


Vehicular location-based technologies [36, 42] are 
viewed by governments as a perfect tool to support ap- 
plications such as electronic toll collection, automated 
law enforcement, or collection of traffic statistics. In Oc- 
tober 2009, the European Commission announced that 
the current flat road tax systems existing in the Member 
States will be substituted by an European Electronic Toll 
Service (EETS) [13, 20]. In the United States, there are 
also ongoing initiatives to introduce Electronic Toll Pric- 
ing (ETP), as for instance the Regional High Occupancy 
Toll Network of the California Metropolitan Transporta- 
tion Commission [1]. 

ETP allows road taxes to be calculated depending on 
parameters such as the distance covered by a driver, the 
kind of road used, or the time of usage. This is benefi- 
cial both for citizens and governments. The former pay 
only for their actual road use, while the latter can im- 
prove road mobility by applying “congestion pricing”. 
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This strategy assigns prices to roads depending on their 
traffic density such that driving in congested roads im- 
plies a higher cost. This in turn will encourage users to 
change their route (or even avoid using their vehicles) 
thus reducing congestion. Moreover, ETP has also en- 
vironmental benefits as it discourages driving hence re- 
duces pollution. 


ETP architectures proposed so far [1, 13, 20] require 
that vehicles are equipped with an on-board unit neces- 
sary for collecting location data. At the end of each tax 
period, the fee corresponding to those data is computed 
either remotely [36, 42] or locally [44], and relayed to 
the service provider. In both cases the service provider 
needs to be convinced that the fees correspond to the ac- 
tual road usage of the driver, and that they have been 
correctly calculated. The verification is straightforward 
in implementations in which all the location data is sent 
to the service provider, but this constitutes an inherent 
threat to users’ privacy. 

We propose PrE’T'P, a privacy-preserving ETP sys- 
tem in which, without making impractical assumptions, 
on-board units 1) compute the fee locally, and 11) prove 
to the service provider that they carry out correct com- 
putations while revealing the minimum amount of lo- 
cation data. PrE’I’P employs a cryptographic protocol, 
Optimistic Payment (OP), in which on-board units send 
along with the final fee commitments to the locations and 
prices used in the fee computation. These commitments 
do not reveal information on the locations or prices to the 
service provider. Moreover, they ensure that drivers can- 
not claim that they were at any other position, nor used 
different prices, from the ones used to create the commit- 
ments. In order to check the veracity of the committed 
values, we rely on the service provider having access to 
a proof (e.g., a photograph taken by a road-side radar or 
a toll gate) that a car was at a specific point at a par- 
ticular time, as previously suggested in [17, 39]. Upon 
being challenged with this proof, the on-board unit must 
respond with some information proving that the location 
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point where it was spotted was correctly used in the cal- 
culation of the final fee. To this end, it opens the com- 
mitment containing this location, thus revealing only the 
location data and the price at the instant specified in the 
proof. This information suffices for the provider to ver- 
ify that correct input data (location and price) was used 
to calculate the fee. 

We formally define Optimistic Payment and propose 
a construction based on homomorphic commitments 
and signature schemes that allow for efficient zero- 
knowledge proofs of signature possession. We prove 
our construction secure under standard assumptions. Fi- 
nally, we present a prototype implementation on an em- 
bedded platform, and demonstrate that the cryptographic 
overhead of Optimistic Payment is efficient enough to 
be practically deployed in commercial in-car devices. 
Further, the fact that on-board units carry out all oper- 
ations without interaction with the driver makes our sys- 
tem ideal in terms of usability. 

The rest of the paper is organized as follows: we de- 
scribe our system models and the security properties we 
seek in Sect. 2. Sect. 3 presents a high level description 
of our construction. Our prototype implementation and 
its evaluation are presented in Sect. 4, and we discuss 
some practical issues in Sect. 5. We situate our work 
within the landscape of proposals for privacy-friendly 
vehicular applications in Sect. 6, and we conclude in 
Sect. 7. Finally, we define the concept of Optimistic Pay- 
ment in Appendix A, and describe in detail our crypto- 
graphic construction in Appendix B. 


2 System model 


PrE'TP employs the architecture and technologies rec- 
ommended at European level [13, 20], although it could 
be adapted to other systems, such as [1]. The system 
model, illustrated in Fig. 1 (left), comprises three enti- 
ties: an On-Board Unit (OBU), a Toll Service Provider 
(TSP), and a Toll Charger (TC). The OBU is an elec- 
tronic device installed in vehicles subscribed to an ETP 
service, and it is in charge of collecting GPS data and cal- 
culating the fee at the end of each tax period. The TSP is 
the entity that offers the ETP service. It is responsible for 
providing vehicles with OBUs and monitor their perfor- 
mance and integrity. Finally, the TC is the organization 
(either public or private) that levies tolls for the use of 
roads and defines the correct use of the system. In agree- 
ment with the TC, the TSP establishes prices for driving 
on each of the roads. Such pricing policy can depend on 
the type of road (e.g., highways vs. secondary roads), its 
traffic density, or the time of the day (e.g., rush hours 
vs. the middle of the night). Additionally, prices can 
also depend on attributes of the vehicle or of the driver 
(e.g., low-pollution vehicles, or discounts for retired peo- 
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ple). For the sake of clarity, in this work we focus on the 
core functionality of PrE’T'P, and defer the discussion of 
practical issues to Sect. 5. 

When the vehicle is driving, the OBU calculates the 
subfees corresponding to the trajectories according to the 
TSP pricing policy. At the end of each tax period, the 
OBU aggregates all the subfees to obtain a total fee and 
sends it to the TSP. This process safeguards the pri- 
vacy of the driver from the TSP, the TC, or any other 
third party eavesdropping the communications, as no lo- 
cation data leaves the OBU. The privacy objectives of 
PrE'TP focus on the limitation of deliberate surveillance 
by any external party with limited access to the vehicle. 
We note that for an adversary with physical access to the 
vehicle it would be trivial to track it, e.g. by installing 
a tracking device. In order to further protect the privacy 
of users from adversaries that have occasional access to 
OBUs (e.g., mechanic, valet), all location data stored in 
the OBU is securely encrypted as specified in [44]. 

Besides preserving users’ privacy, the system has to 
protect the interests of both TC and TSP and provide 
means to prevent users from committing fraud. Our 
threat model considers malicious drivers capable of tam- 
pering with the internal functionality of the OBU, as well 
as with any of its interfaces. Under these considerations, 
we define the security goals of our system as the detec- 
tion of: 


Vehicles with inactive OBUs. Drivers should not be 
able to shut down their OBUs at will to simulate they 
drove less. 

OBUs reporting false GPS location data. Drivers 
should not be able to spoof the GPS signal and simulate 
a cheaper route than the actual roads on which they are 
driving. 

OBUs using incorrect road prices. Drivers should 
not be able to assign arbitrary prices to the roads on 
which they are driving. 

OBUs reporting false final fees. Drivers should not 
be able to report an arbitrary fee, but only the result from 
the correct calculations in the OBU. 


Focusing on the detection of tampering rather that at 
its prevention allows us to consider a very simple OBU 
with no trusted components, reducing the production 
costs of the device. 

In order to perform this detection, reliable information 
about the vehicle’s whereabouts is required. We consider 
that the TC can perform random “spot checks” that are 
recorded as proof of the time and location where a vehi- 
cle has been seen. Such spot checks can be carried out by 
using an automatic license plate reader, a police control, 
or even challenging the OBUs using Dedicated Short- 
Range Communications (DSRC) [13]. Without loss of 
generality in this work we assume that the proof is gath- 
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Figure 1: Entities in our Electronic Toll Pricing architecture (left.) Enforcement spot-check model (right.) 


ered using an automatic license plate reader. This proof 
can be used to challenge the vehicle’s OBU to verify its 
functioning. In order to be able to respond to this chal- 
lenge, the OBU slices the trajectories recorded in seg- 
ments, and computes the subfees corresponding to them, 
such that these subfees add up to the final fee transmit- 
ted to the TSP. For each segment, the TSP receives a 
payment tuple that consists of a commitment to location 
data and time, a homomorphic commitment to the sub- 
fee, and a proof that the committed subfee is computed 
according to the policy. These payment tuples, explained 
in detail in the next section, bind the reported final fee 
to the committed values such that the OBU cannot claim 
having used other locations or prices in its computations. 
Furthermore, they are signed by the OBU to prevent a 
malicious TSP from framing an honest driver. 


The verification process, depicted in Fig. 1 (right), is 
initiated when the TC gathers a proof of location of a 
vehicle. Then it forwards this information to the TSP, 
along with a request to check the correct functioning of 
the vehicle’s OBU. To this end, the TSP challenges the 
OBU to open a commitment containing the location and 
time appearing in the proof. The TSP verifies that both 
challenge and response match, for instance as explained 
in [39], and reports to the ‘TC whether or not the func- 
tioning of the OBU is correct. We assume that the TC 
(e.g., the government in the EETS architecture) is honest 
and does not use fake proofs to challenge OBUs. 


3 Optimistic Payment 


In this section we sketch the technical concepts neces- 
sary to understand the construction of Optimistic Pay- 
ment, and we outline our efficient implementation of the 
protocol. For a comprehensive and more formal descrip- 
tion of OP, we refer the reader to Appendix B. 
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3.1 Technical Preliminaries 


Signature Schemes. A signature scheme consists 
of the algorithms SigKeygen, SigSign and SigVerify. 
SigKeygen outputs a secret key sk and a public key 
pk. SigSign(sk, x) outputs a signature s, of message x. 
SigVerify(pk, x, s,) outputs accept if s, is a valid signa- 
ture of x and reject otherwise. A signature scheme must 
be correct and unforgeable [26]. Informally speaking, 
correctness implies that the SigVerify algorithm always 
accepts an honestly generated signature. Unforgeability 
means that no p.p.t adversary should be able to output a 
message-signature pair (x, s,) unless he has previously 
obtained a signature on 2. 


Commitment schemes. A _ non-interactive commit- 
ment scheme consists of the algorithms ComSetup, 
Commit and Open. ComSetup(1") generates the 
parameters of the commitment scheme paramscom. 
Commit(paramscom,x) outputs a commitment c, to 
x and auxiliary information open,. A commitment is 
opened by revealing (x, open,.) and checking whether 
Open(paramscom, Cx, X, open.) is true. A commitment 
scheme has a hiding property and a binding property. 
Informally speaking, the hiding property ensures that 
a commitment c;, to x does not reveal any informa- 
tion about x, whereas the binding property ensures that 
C, cannot be opened to another value x’. Given two 
commitments c,, and c,, with openings (71, open,,, ) 
and (x2, open,..) respectively, the additively homomor- 
phic property ensures that, if c = Cz, - Cz,, then 
Open(paramscom,€, £1 + £2, open,, + Open,, ). 


Proofs of Knowledge. A zero-knowledge proof of 
knowledge is a two-party protocol between a prover and 
a verifier. The prover proves to the verifier knowledge 
of some secret values that fulfill some statement without 
disclosing the secret values to the verifier. For instance, 
let x be the secret key of a public key y = g”, and let 
the prover know (2, g,y), while the verifier only knows 
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(g, y). By means of a proof of knowledge, the prover can 
convince the verifier that he knows x such that y = g”, 
without revealing any information about x. 


3.2 Intuition Behind Our Construction 


We consider a setting with the entities presented in 
Sect. 2. During each tax period tag, the OBU slices 
the trajectories of the driver in segments formed by a 
structure containing GPS location data and time. Addi- 
tionally, this data structure can contain information about 
any other parameter that influences the price to be paid 
for driving on the segment. We represent this data struc- 
ture as a tuple (loc, time). The TSP establishes a func- 
tion f : (loc, time) — YT that maps every possible tuple 
(loc, time) to a price p € Y. For each segment, the 
OBU calculates f on input (loc, time) to get a price p, 
and computes a payment tuple that consists of a random- 
ized hash hf on the data structure (loc, time), a homo- 
morphic commitment c, to its price, and a proof 7 that 
the committed price belongs to TY. The randomization of 
the hash is needed in order to prevent dictionary attacks 
to recover (loc, time). 

At the end of the tax period, the OBU and the TSP en- 
gage in a two-party protocol. The OBU adds the fees of 
all the segments to obtain a total fee fee. The OBU adds 
all the openings open,, to obtain an opening opens... 
Next, the OBU composes a payment message ™ that 
consists of (tag, fee, open,., ) and all the payment tuples 
(h, Cp, 7). The OBU signs m and sends both the mes- 
sage m and its signature s,, to the TSP. The TSP veri- 
fies the signature and, for each payment tuple, verifies the 
proof 7. Then the TSP, by using the homomorphic prop- 
erty of the commitment scheme, adds the commitments 
Cp Of all the payment tuples to obtain a commitment cy... 
and checks that (fee, open,.) is a valid opening for c;..,.. 

When the TC sends the TSP a proof ¢ that a car was 
at some position at a given time, the TSP relays ¢ to the 
OBU. The OBU first verifies that the request is signed 
by the TC, and then it searches for a payment tuple 
(h,Cp,7) for which pu(¢, (loc, time)) outputs accept. 
Here, ps : (¢, (loc, time)) + {accept, reject} is a func- 
tion established by the TSP that outputs accept when 
the information in ¢ and in (loc, time) are similar in ac- 
cordance with some metric, such as the one proposed 
in [39]. Once the payment tuple is found, the OBU sends 
the number of the tuple to the TSP together with the 
preimage (loc, time) of h and the opening (p, open.) 
of cp. The TSP checks that (p, open, ) is the valid open- 
ing of c,, that (loc, time) is the preimage of h and that 
L(@, (loc, time)) outputs accept. 

Intuitively, this protocol ensures the four security 
properties enunciated in the previous section. Drivers 
cannot shut down their OBUs, nor report false GPS data 
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as they run the risk of not having committed to a seg- 
ment containing the (loc, time) in the challenge ¢. We 
note that after sending (m, s,,) to the TSP, OBUs can- 
not claim that they were at any position (loc’, time’) dif- 
ferent from the ones used to compute the message m. 
Similarly, OBUs cannot use incorrect road prices with- 
out being detected, as the TSP can check whether the 
correct price for a segment (loc, time) was used once 
the commitments are opened. The homomorphic prop- 
erty ensures that the reported final fee is not arbitrary, 
but the sum of all the committed subfees. Moreover, 
by making the OBU prove that the committed prices be- 
long to the image of f, we avoid that a malicious OBU 
could decrease the final fee by sending only one wrong 
commitment to a negative price in the payment message, 
which would give it an overwhelming probability of not 
being detected by the spot checks. Additionally, the fact 
that the OBU signs the payment message m ensures that 
no malicious TSP can frame an OBU by modifying the 
received commitments, and that a malicious OBU can- 
not plead innocent by invoking the possibility of being 
framed by a malicious TSP. Similarly, the fact that 
the TC signs the challenge ¢ prevents a malicious TSP 
sending fake proofs to the OBU, e.g. with the aim of 
learning its location. Finally, the privacy of the drivers 
is preserved as the OBU does not need to disclose more 
location information than that in the payment tuple that 
matches the proof @ (already known to TSP). 


3.3. Efficient Instantiation: High Level 
Specification 


We now outline at high level our efficient instantiation 
of Optimistic Payment. We employ the integer com- 
mitment scheme due to Damgard and Fujisaki [15] and 
the CL-RSA signature scheme proposed by Camenisch 
and Lysyanskaya [9]. Both schemes use cryptographic 
keys based on special RSA modulus n of length /,. 
A commitment c, to a value x is computed as c, = 
go" gi°P"* (mod n), where the opening open, is a ran- 
dom number of length /,, and the bases (go, gi) corre- 
spond to the commitment public parameters. Given a 
public key pk = (n, R,S,Z), a CL-RSA signature has 
the form (A, e, v), with lengths /,,, /., and /,, respectively, 
such that Z = A°R*S”(mod n). To prove that a price 
belongs to Y, we use a non-interactive proof of posses- 
sion of a CL-RSA signature on the price. We also em- 
ploy a collision resistant hash function H : {0,1}* > 
{0,1}!<. 

Initialization. The pricing policy f : (loc, time) > YT, 
where each price p € Y has associated a valid CL-RSA 
signature (A,e,v) generated by the TSP, the crypto- 
graphic key pair (pkopy, skopu), the public key of the 
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Protocol 1: Protocol between OBU and TSP during taxing phase 


TSP (n, R, S,Z), the public key of TC, and the public 
parameters (go, 91) of the commitment scheme are stored 
on the OBU. Similarly, the TSP possesses its own secret 
key (skpgp) and knows all the public keys in the system. 


Tax period. Protocol | illustrates the calculations and in- 
teractions between the OBU and the TSP under normal 
functioning during the tax period. We denote the opera- 
tions carried out by the OBU as Pay(), and the operations 
executed by the TSP as VerifyPayment(). While driving, 
the OBU collects location data and slices it in segments 
(loc, time) according to the policy. For each of the N 
collected segments, the OBU generates a payment tu- 
ple (Xz, Cp, 7). This iterative step is broken down in 
lines 1 to 21 in Protocol 1. The most resource consum- 
ing operation is the computation of 7,, which proves the 
possession of a valid CL-RSA signature on the price pz 
(lines 9 to 20). The length of the random values used 
in this step is specified in Appendix B.2. At the end of 
the tax period the OBU generates and signs the payment 
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message m including the tag tag, the total fee, the open- 
ing openfee, and all the payment tuples (hz, Cp,, Tr) 
lines 22 to 26. Finally it sends (m, s,,,) to the TSP. 


Upon reception of a payment message, the TSP exe- 
cutes the VerifyPayment() algorithm. First the TSP veri- 
fies the signature s,,, using the OBU’s public key pk opy. 
Next, it proceeds to the verification of the proof 7, 1n- 
cluded in each of the N payment tuples contained in m, 
lines 13 to 22. In each iteration it performs a series of 
modular exponentiations, and uses the intermediate re- 
sults to compute the hash ch’. Then, it checks whether 
ch’ is the same as the value ch contained in z;,. If this 
verification, together with the two range proofs in lines 
20 and 21, is successful, the TSP is convinced that all 
the prices p; used by the OBU are indeed a valid image 
of f. Finally, the TSP validates the commitments c,,, to 
ensure that the aggregation of all subfees add up to the fi- 
nal fee (lines 24 to 26). For this, it calculates Cris as the 
product of all commitments c,,, and computes the com- 
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mitment C fee using the values fee and open fee provided 
by the OBU. If both values are the same, the TSP is con- 
vinced that the final fee reported by the OBU adds up to 
the sum of all subfees reported in the payment tuples. 


Proof Challenge. We denote as OBUopen() and Check() 
the algorithms carried out by the OBU and the TSP, re- 
spectively, when the former is challenged with @. When 
running the OBUopen() algorithm, the OBU searches 
for the pre-image (loc;, tume;) of a hash hy containing 
the location and time satisfying @, and sends this infor- 
mation to the service provider along with the price pz 
and the opening openy, . 

Upon reception of this message, the TSP executes the 
Check() algorithm. First, it verifies whether the segment 
(locy,, tume,) actually contains the location in ¢. Then, 
it computes the value h,, = H(locx, timex) and checks 
whether the OBU had committed to this value in one of 
the payment tuples reported during the tax period. Lastly, 
the T'SP uses openy, to open the commitment c,, and 
verifies whether p;, = f(locy,, time;,) equals the price 
Dr reported by the OBU during the OBUopen() algo- 
rithm. If all verifications succeed, the TSP is convinced 
that the location data used by the OBU in the fee calcu- 
lation and the price assigned by the OBU to the segment 
(loc, time;,) are correct. 


4 PrETP Evaluation 


In this section we evaluate the performance of PrE'T'P. 
We start by describing the test scenario and both our 
OBU and TSP prototypes. Next, we analyze the perfor- 
mance of the prototypes for different configuration pa- 
rameters. Finally, we study the communication overhead 
in PrE'T'P, and compare it to existing ETP systems. 


4.1 Test Scenario 


Policy model. The first step in the implementation of 
PrE'TP consists in specifying a policy model in the form 
of the mapping function f : (loc, time) — YT. We de- 
cide to follow the same criteria as currently existing ETP 
schemes [36], i.e., road prices are determined by two pa- 
rameters: type of road and time of the day. More specif- 
ically, we define three categories of roads (‘highway’, 
‘primary’, and ‘others’) and three time slots during the 
day. For each of the possible nine combinations we as- 
sign a price per kilometer p and we create a valid signa- 
ture (A,e,v) using the TSP’s secret key. We note that 
the choice of this policy is arbitrary and that PrE'T'P, as 
well as OP, can accommodate other price strategies. 


Location data. We provide the OBU with a set of loca- 
tion data describing a real trajectory of a vehicle . These 
data are obtained by driving with our prototype for one 
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hour in an urban area, covering a total distance of 24 
kilometers. We note that such dataset is sufficient to val- 
idate the performance of PrE'T'P, since results for differ- 
ent driving scenarios (e.g., faster or slower) can easily be 
extrapolated from the results presented in this section. 


Parameters of the instantiation. The performance of 
OP depends directly on the length of the protocol instan- 
tiation parameters, and in particular, on the size of the 
cryptographic keys of the entities (/,,). In our experi- 
ments we consider three case studies: medium security 
(1, = 1024 bits), high security (/,, = 1536 bits), and very 
high security (/,, = 2048 bits). The value /,, is determined 
by the length of the prices p, which in turn determines the 
value of /.. Therefore, both lengths are constant for all 
security cases. The value of /,, varies depending on the 
value of /,,. Finally, the rest of parameters (/;,, /,, /,, and 
[.) are set as the output length of the chosen hash func- 
tion primitive (see Sect. 4.2). These lengths determine 
the size of the random numbers generated in line 13 in 
Protocol 1 (see Appendix B for a detailed explanation). 
Table 1 summarizes the parameter lengths considered for 
each security level. 


Table 1: Length of the parameters (in bits) 


Parameter be be Lis bo.  teilpslile 
Normal Sec. 1024 128 1216 32 160 
High Sec. 1536 128 1728 32 160 


Very high Sec. 2048 128 2240 32 160 


OBU Platform. In order to make our prototype as real- 
istic as possible, we implement PrE’T'P using as starting 
point the embedded design described in [4], which per- 
forms the conversion of raw GPS data into a final fee 
internally. We extend and adapt this prototype with the 
functionalities of OP to make it compatible with PrETP. 

At high-level, the elements of our OBU prototype [4] 
are: a processing unit, a GPS receiver, a GSM modem, 
and an external memory module. We use as benchmark 
the Keil MCB2388 evaluation board [30], which contains 
an NXP LPC2388 [34] 32-bit ARM7TDMI [2] micro- 
controller. This microcontroller implements a RISC ar- 
chitecture, it runs at 72 MHz, and it offers 512 Kbytes 
of on-chip program memory and 98 Kbytes of internal 
SRAM. As external memory, we use an off-the-shelf 
1 GByte SD Card connected to the microcontroller. Fi- 
nally, we use the Telit GM862-GPS [43] as both GPS 
receiver and GSM modem. 

As our platform does not contain any cryptographic 
coprocessors, we implement all functionalities exclu- 
sively in software. Note that although we could easily 
add a hardware coprocessor (e.g., [35]) to the prototype 
in order to carry out the most expensive cryptographic 
computations, we choose the option that minimizes the 


USENIX Association 


production costs of the OBU. Besides, this approach al- 
lows us to identify the bottlenecks in the protocol im- 
plementation, leaving the door open to hardware-based 
improvements if needed. 

We have constructed a cryptographic library with the 
primitives required by our instantiation of the OP proto- 
col, namely: 1) a modular exponentiation technique, 11) a 
one-way hash function, and 111) a random number gener- 
ator. For the first primitive we use the ACL [5] library, 
a collection of arithmetic and modular routines specially 
designed for ARM microcontrollers. As hash function 
we choose RIPEMD-160 [22], with an output length /;, 
of 160 bits. As our platform does not provide any phys- 
ical random number generator, we use the Salsa20 [6] 
stream cipher in keystream mode as third primitive. We 
note that a commercial OBU should include a source of 
true randomness. 

In order to keep the OBU flexible and easily scalable, 
we arrange data in different memory areas depending on 
their lifespan. Long-term parameters (pk opy, skosu; 
pk-psgp, commitment parameters) are directly embedded 
into the microcontroller’s program memory, while short- 
term parameters (payment tuples, (loc, time) segments) 
and updatable parameters (digital road map, policy f/f) 
are stored separately on the SD Card. We note that our 
library provides a byte-oriented interface with the SD 
Card, resulting in a considerable overhead when read- 
ing/writing values. 


TSP Platform. We implement our TSP prototype on a 
commodity computer equipped with an Intel Core2 Duo 
E8400 processor at 3 GHz, and 4 Gbyte of RAM. We use 
C as programming language, and the GMP [25] library 
for large-integer cryptographic operations. 


4.2 Performance Evaluation 


OBU performance. The most time-consuming opera- 
tions carried out by the OBU during the taxing phase are 
the Mapping() algorithm and the Pay() algorithm. The 
Mapping() algorithm is executed every time a new GPS 
string is available in the microcontroller. Its function is 
to search in the digital road map the type of road given 
the GPS coordinates. When the vehicle drives for a kilo- 
meter, the OBU maps the segment to the adequate price 
pr as specified in the policy. At this point, the Pay() al- 
gorithm is executed in order to create the payment tuple. 
For each segment, the OBU generates: i) a hash value hz 
of the location data, 11) a commitment c,, to the price pz, 
and 111) a proof 7; proving that the price p; is genuinely 
signed by the ‘TSP (and thus belongs to the image of /f). 
To protect users’ privacy we also require that no sensi- 
tive data is stored in the SD Card in plaintext form. For 
this purpose we use the AES [33] block cipher in CCM 
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mode [23] with a key length of 128 bits. We denote this 
operation as /;,. At the end of the taxing phase, the OBU 
adds all the prices p;, mapped to each segment to obtain 
the fee, and all the openings open, to obtain opens... 
Finally, the OBU constructs and signs the payment mes- 
sage m and sends it to the TSP. 

As it does not involve the key, the computing time of 
the Mapping() algorithm is independent of the security 
scenario. Further, this time only depends on the duration 
of the trip and is independent of the speed of the vehicle: 
the Mapping() algorithm is always executed 3 600 times 
per hour, taking a total of 839.11 seconds in our proto- 
type. However, for each of the segments this time can 
vary depending on the number of points that have to be 
processed, i.e., depending on the speed of the vehicle. In 
our experiments it requires 76.10 seconds for the longest 
segment, i.e., the one where the vehicle spent more time 
to drive one kilometer and thus (loc,, teme;,) contains 
the larger number of points. 

Similarly, the execution time for hy, and E; depends 
exclusively on the length of the segments (loc,, timex), 
as it iS proportional to the number of GPS points in the 
segments. The amount of points per segment varies not 
only with the average speed of the car but also depending 
on the length of the segments defined in the pricing pol- 
icy. In our experiments, computing hz; and FE; take 0.08 
seconds and 0.43 seconds, respectively, for the shortest 
and the longest segments. For the Mapping() algorithm 
and both hz and Ey operations, more than 90% of the 
time is spent in the communication with the SD card. 

On the other hand, the execution time for c,, and 7; 
is constant for all segments, as it does not depend on the 
length of a particular slice (see lines 6 to 20 in Proto- 
col 1). In order to calculate c,,, the OBU needs to gen- 
erate a random opening open,, and perform two mod- 
ular exponentiations and a modular multiplication. The 
computation of 7, involves the generation of ten random 
numbers and a hash value, and the execution of fourteen 
modular exponentiations, nine modular multiplications, 
eight additions, and eight multiplications. The bottle- 
neck of both operations is determined by the modular 
operations. Although we could take advantage of fixed- 
base modular exponentiation techniques, we choose to 
use multi-exponentiations algorithms [18], which have 
less storage requirements. Multi-exponentiation based 
algorithms, which compute values of the form a?c@(mod 
n) in one step, allow us to considerably speed up the pro- 
cess. The average execution times for computing cp, 
are ().76 seconds, 2.25 seconds, and 5.69 seconds for 
medium, high, and very high security respectively. For 
T;., these times are 6.20 seconds, 19.45 seconds, and 
41.64 seconds, respectively. 

Table 2 summarizes the timings for all OBU opera- 
tions and routines for a journey of one hour. We note 
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Table 2: Execution times (in seconds) for an hour journey of 24 km, for all possible security scenarios. 


Medium Security 
Algorithm Segment Full trip 
Mapping () 76.10s 839.11 s 
7.88 S 183.91 s 
hr 0.08 s 1.08 s 
Pay () Ex 0.43 s 6.35 s 
Cp, 0.76 s 18.19 s 
Tk 6.20 s 158.09 s 


that, even when 2048-bit RSA keys are used, the OBU 
can perform all operations needed to create the payment 
tuples in real time. While the trip lasted one hour, the 
Mapping() and Pay() algorithms only required 1 982.41 
seconds. The computation time is dominated by the 
Pay() algorithm, which depends on the number of GPS 
strings in each segment (loc, time). This number varies 
with the speed of the vehicle and the pricing policy. If 
a vehicle is driving at a constant speed, policies that 
establish prices for small distances result in segments 
containing less GPS points than policies that consider 
long distances. Similarly, given a policy fixing the size 
of the segments, driving faster produces segments with 
less points than driving slower. In both cases, 7; has to 
be computed fewer times and the Pay() algorithm runs 
faster. Thus, the policy can be used as tuning parameter 
to guarantee the real-time operation of the OBU. 


Using the values in Table 2, for each of the levels 
of security we can calculate the time our OBU is idle 
— in our case (3600 — 839.11) seconds, with 839.11 
seconds being the time required by the Mapping() al- 
gorithm. Then, considering our current policy, we can 
estimate the number of times the Pay() algorithm could 
be executed, which in turn represents the number of kilo- 
meters that could have been driven by a car in one hour, 
1.e., the average speed of the car. For normal security, 
our OBU could operate in real time even if a vehicle was 
driving at 350 km/h. This speed decreases to 124 km/h 
when 1536-bit keys are used, and to 57 km/h if the keys 
have length 2048 bits. Only when using high security 
parameters our OBU would have problems to operate 
in the field. However, as mentioned before, including 
a cryptographic coprocessor in the platform would suf- 
fice to solve this problem whenever high security is re- 
quired. Moreover, in our tests we consider a worst-case 
scenario in which all GPS strings are processed upon re- 
ception. In fact, processing fewer strings would suffice 
to determine the location of the vehicle. As the execu- 
tion time required by the Mapping() algorithm would 
decrease linearly, OBUs would be able to support higher 
vehicle speeds. 


In the OBUopen() algorithm, only executed upon re- 
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High Security Very high Security 


Segment Fulltrip Segment Full trip 
76.10 s 839.11 s 76.10 s 839.11 s 
22.13 s 528.47 s 47.79 s 1 143.30 s 
0.08 s 1.08 s 0.08 s 1.08 s 
0.43 s 6.35 s 0.43 s 6.35 s 
2.25 § 54.08 s 5.69 s 136.82 s 
19.45 s 466.96 s 41.645 999.05 s 


quest from TC, the OBU searches its memory for a seg- 
ment (loc, time) in accordance to the proof sent by the 
TSP. Here, the time accuracy provided by the GPS sys- 
tem is used to ensure synchronization between the data 
in @ and the segment (loc, time). The main bottleneck 
of this operation is the decryption of the location data 
corresponding to the correct segment. On average, our 
prototype can decrypt such a segment in 0.27 seconds. 


TSP performance. The most consuming task the TSP 
must perform corresponds to the VerifyPayment() algo- 
rithm, which has to be executed each time the TSP re- 
ceives a payment message. This algorithm involves three 
operations: the verification of the proof 7; for each seg- 
ment, the multiplication of all commitments c,, to obtain 
Cfee, and the opening of Cree in order to check whether 
it corresponds to the reported final fee. The most costly 
operation is the verification of 7; in particular the calcu- 
lation of the parameters (t,,_, t';, t,,, t’) which requires 
a total of eleven modular exponentiations (lines 14 to 22 
in Protocol 1). 


Table 4.2 (left) shows the performance of the 
VerifyPayment() algorithm for each of the considered 
security levels when segments have length one kilome- 
ter. We also provide an estimation of the time required 
to process all the proofs sent by OBU during a month, 
assuming that a vehicle drives an average of 18000 km 
per year (1 500 km per month). 


These results allow us to extrapolate the number of 
OBUs that can be supported by a single TSP in each se- 
curity scenario for different segment lengths. Intuitively, 
the capacity of TSP increases when segments are larger, 
as the payment messages contain fewer proofs 7. The 
number of OBUs supported by a single ‘TSP is presented 
in Table 4.2 (right). For a segment length of 1 km, the 
TSP is able to support 164 000, 58 000, and 29 000 vehi- 
cles depending on the chosen security level. Even when 
l,, is 2048 bits, only 36 servers are needed to accommo- 
date one million OBUs. This number can be reduced 
by parallelizing tasks at the server side, or by using fast 
cryptographic hardware for the modular exponentiations. 
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Table 3: Timings (in seconds) for the execution of VerifyPayment() in TSP (left). Number of OBUs supported by a 


single TSP (right). 


VerifyPayment() Segment One Month 
Medium Sec. 0.0105 s 15.750 s 
High Sec. 0.0295 s 44.250 s 
Very high Sec. 0.0587 s 88.050 s 


4.3 Communication overhead 


We now compare the communication overhead of 
PrETP with respect to straightforward ETP implemen- 
tations and VPriv [39]. Both in straightforward ETP 
implementations and in VPriv the OBU sends all GPS 
strings to the TSP. Let us consider that vehicles drive 
1500 km per month at an average speed of 80 km/h. 
Then, transmitting the full GPS information to the the 
TSP requires 2.05 Mbyte (considering a shortened GPS 
string of 32 bytes containing only latitude, longitude, 
date and time). VPriv requires more bandwidth than 
straightforward ETP systems, as extra communications 
are necessary to carry out the interactive verification pro- 
tocol (see Sect. 6). Using PrE'T'P, the communication 
overhead comes from the payment tuples that must be 
sent along with the fee. For each segment, the OBU 
sends the payment tuple (h,c,,7) to the TSP. When 
sent uncompressed, this implies an overhead of approxi- 
mately 1.5 Kbyte per segment, 1.e., less than 2 Mbyte per 
month, for medium security (/,,=1024 bits). Addition- 
ally, less than 50 Kbyte have to be sent occasionally to 
respond a verification challenge after a vehicle has been 
seen at a spot check. We believe this overhead is not ex- 
cessive for the additional security and privacy properties 
offered by PrETP. 

The communication overhead in PrE’T'P is dominated 
by the payment message m sent by the OBU to the TSP, 
the length of which depends on the number of segments 
covered by the driver. Therefore, the segment length 
can be seen as a parameter of the system that tunes the 
tradeoff between privacy and communication overhead. 
The smaller the segments, the larger the communication 
overhead, because more tuples (hx, Cp,, 7,) need to be 
sent. Allowing larger segments reduces the communica- 
tion cost but also reduces privacy because the OBU must 
disclose a bigger segment when responding a verification 
challenge. 

Further, the communication overhead can be almost 
eliminated by having the OBU sending only the hash of 
the payment message at the end of each tax period and 
leave the correct operation verification subject to random 
checks. Following the spirit of the random “spot checks” 
used for checking the input and prices, the OBUs could 
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Segment size Medium Sec. HighSec. Very high Sec. 


0.5 km 82 000 29 000 14000 
0.75 km 123 000 43 000 22 000 
1 km 164 000 58 000 29 000 
2km 329 000 117000 58 000 
3 km 493 000 175 000 88 O00 


occasionally be challenged to prove its correct function- 
ing by sending the payment message corresponding to 
the preimage of the hash sent at the end of a random tax 
period. 


5 Discussion 


Practical issues. Our OP scheme allows the OBU to 
prove its correct operation to the TSP while revealing a 
minimum amount of information. Nevertheless, we note 
that fee calculation is not flexible. The reason is that the 
OBU should store signatures created by the TSP on all 
the prices that belong to Im(/f), and thus, for the sake of 
efficiency, we need to keep Im(f) small. For this pur- 
pose, in our evaluation f is only defined for trajectory 
segments of a fixed length (one kilometer) and of a fixed 
road type. There are two obvious cases in which this 
feature is problematic: when a vehicle has driven a non- 
integer amount of kilometers, and when one of the seg- 
ments contains pieces of roads with different cost (e.g., 
when a driver leaves the highway entering a secondary 
road). In both cases the OBU cannot produce a payment 
tuple because it does not have the signature by the TSP 
on the price of the segment. 

There are two possible solutions to these issues. A first 
option would be to solve them at contractual level. The 
policy designed by the TSP could include clauses that 
indicate how to proceed when these conflicts arise. For 
instance, in the first case the TSP could dictate that the 
driver must pay for the whole kilometer, and in the sec- 
ond case the policy could be that the price corresponds to 
the cheapest of the roads, or to the most expensive. We 
note that these decisions do not conflict with the general 
purpose of the system: congestion control, as in all cases, 
on average, drivers will pay proportionally to their use of 
the roads. The second option would be to change the 
way the OBU proves that the committed prices belong 
to Im(f). In the construction proposed in Sect. 3, the 
OBU employs a set membership proof, based on prov- 
ing signature possession, to prove that the committed 
prices belong to the finite set Im(f). Alternatively, we 
can define Im(f) as a range of (positive) prices, and let 
the OBU use a range proof to prove that the committed 
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prices belong to Im(f). Since now Im(/f’) is much big- 
ger, f can be defined for segments of arbitrary length that 
include several types of road. We outline a construction 
that employs range proofs in the extended version of this 
work [3]. 

Another issue is that our OP scheme does not offer 
protection against OBUs that do not reply upon receiv- 
ing a verification challenge. In this case, the TSP should 
be able to demonstrate to the TC that the OBU is misbe- 
having. To permit this, the ‘SP can delegate to the TC 
the verification of the “spot-check”’, i.e, the TSP sends 
the payment message m™ and the signature s,,, to the TC, 
and the TC interacts with the OBU (electronically, or by 
contacting the driver through some other means) to ver- 
ify that ™m 1s valid. 

Although in Sect. 2 we mentioned that the cost associ- 
ated with roads could depend on attributes of the driver 
(e.g., retired users may get discounts) or on attributes of 
the car (e.g., ecological cars may have reduced fees), the 
pricing policy used by our prototype is rather inflexible. 
We note that this is a limitation of our prototype and that 
PrE'TP can support more flexible policies. For instance, 
the TSP can apply discounts to the total fee reported by 
the OBU, without the knowledge of fine grained location 
data. Further, the system model in this work considers 
only one service provider. However, the European legis- 
lation [13, 20] points out that several SPs may provide 
services in a given Toll Charger domain. PrE’T'P can be 
trivially extended to this setting. 


Production cost. Our OBU prototype, constructed with 
off-the-shelf components, demonstrates that a system 
like PrETP can be built at a reasonable cost !. Although 
the security of our Optimistic Payment scheme does not 
rely on any countermeasure against physical attacks by 
drivers, for liability reasons it is desirable to use OBUs 
with a certain level of tamper resistance. Nevertheless, 
we note that on-board units in the market [36, 42] al- 
ready rely on tamper resistance. Further, secure remote 
firmware updates are also required in privacy invasive de- 
signs, and additional updates in PrE'T'P containing new 
maps and policies can be considered occasional. 


Privacy. Although we protect the privacy of the users 
by keeping the location data in the client domain and 
exploiting the hiding property of cryptographic commit- 
ments, there exist a few sources of information available 
to the TSP. First, as in many other services, users in 
PrE'TP must subscribe to the service by revealing their 
identity, and most likely their home address, to the TSP. 
Second, the final fee and all the commitments (which in- 
dicate the number of kilometers driven), must be sent to 
the TSP at the end of each tax period. Decoding tech- 


'The cost of our prototype amounts to $500; such a number would 
be drastically reduced in a mass-production scenario. 
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niques (e.g., [16]) using these data could be employed 
by the TSP to infer the trajectories followed by a ve- 
hicle by inspecting the possible combination of prices 
per kilometers that could have generated the total fee. 
A possible solution to this problem consists in giving 
users the possibility to send data associated to dummy 
segments. For this, a price p zero should be included in 
the pricing policy so that it does not imply any cost for 
the drivers when aggregating the homomorphic commit- 
ments, and that the proofs 7, are still accepted by the 
TSP. The downside of this approach is that it introduces 
an overhead in both the processing of the OBU and the 
communication link with the TSP. Apart from this, sub- 
liminal channels in the communication or the encryption 
schemes must be avoided, e.g., by proving a true physi- 
cal randomness source in the OBU (see [44] for further 
discussion on the topic). 


Legal Compliance. We build on the analysis pre- 
sented in [44] and discuss the compliance of PrE’T’P 
with European Legislation. With regard to data pro- 
cessing, the data controller (Art.6.2. in [13]) has to 
abide by principles found in the Data Protection Direc- 
tive 95/46/EC [21] (DPD) in Art. 6.1, 16 and 17. We use 
these principles to assess compliance of the proposed ar- 
chitecture since these principles have been further spec- 
ified in the other provisions of the DPD. We only look 
at the principles of direct interest for this paper which 
are that 1) the data must be adequate, relevant and not 
excessive, 11) kept accurate and up to date, 111) the data 
should be processed in a secure and confidential man- 
ner and iv) data should not be kept longer than neces- 
sary. Firstly, data must be kept accurate and up-to-date 
(Art. 6.1(d) in [21]). In PrETP the OBU commits to lo- 
cation data and to its price when reporting the final fee. 
These commitments do not reveal any details on the lo- 
cation or the price calculation. Given that the controller 
is only allowed to process the data adequate, relevant and 
not excessive for the provision of the service (Art. 6.1(c) 
in [21]), this seems a good solution to the problem. The 
TC and the TSP should know that the information given 
by the user is correct but the information that the com- 
mitment covers is not needed for PrE'I’P [28, 38]. The 
commitments implemented in PrE’TP are designed to 
guarantee that the OBU sends out the correct data with- 
out putting all the user’s data in the hands of the TSP 
or the TC. The TC might want to execute checks at 
certain points in time to verify the veracity of these com- 
mitments and sends “‘spot-checks” to the TSP, which in- 
teracts with the OBU for the sake of verification. Only 
at those times will more data be disclosed because then 
it is required to know the information the commitment 
is based on to know whether the commitment is reli- 
able. Data used for verification will however only be 
kept when an infringement is found. If there is no in- 
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fringement, the data will not be kept in accordance with 
data protection principles (Art. 6.1(E) in [28, 38]). Sec- 
ondly, the processing must be secure and confidential as 
stated by Art. 16-17 in [21]. A positive step of PrE'T'P in 
this regard is keeping all the data inside the OBU and the 
applied algorithms to protect these data [28, 38]. The al- 
gorithms presented in this work are designed to reconcile 
the conflicting interest of the users and the TSP, while 
protecting the user from excessive data processing (note 
that the data set in road tolling could be potentially quite 
comprehensive — Art. 7, Annex VI in [13]) ). This crite- 
rion may be the most important in a road tolling setting. 


6 Related work 


A privacy-friendly architecture for ETP in which loca- 
tion data is not revealed to the service provider was pre- 
sented in [44], and its viability was shown in [4]. How- 
ever, the design by [44] does not take into account that 
the TSP and the TC need to check the correctness of the 
operations carried out in the on-board unit jeopardizing 
its applicability to real world scenarios. 

Another line of research has focused on the design of 
secure multi-party protocols between the TSP and the 
OBUs that allow 'TSPs to compute the total fee and de- 
tect malicious OBUs while protecting location privacy. 
Solutions proposed in [8, 7, 40] resort to general reduc- 
tions for secure multi-party computation and are very in- 
efficient. A more efficient protocol, VPriv, was proposed 
in [39]. The basic idea consists in sending the location 
data generated by a driver sliced into segments to the 
TSP, in such a way that it remains hidden among seg- 
ments from multiple drivers. Then the TSP calculates 
the subfees (fees of small time periods that add to the fi- 
nal fee) of all segments and returns them to all OBUs. 
Each OBU uses this information to compute its total fee 
and, without disclosing any location data, proves to the 
TSP that the total fee is computed correctly, i.e., by only 
using the subfees that correspond to the location data 1n- 
put by this particular OBU. Moreover, in order to pre- 
vent malicious users from spoofing the GPS signal to 
simulate cheaper trips, VPriv has an out-of-band enforce- 
ment mechanism. This mechanism is based on the use of 
random spot checks that demonstrate that a vehicle has 
been at a location at a time (e.g., a photograph taken by 
a road-side radar). Given this proof, the TSP challenges 
the OBU to prove that its fee calculation includes the lo- 
cation where the vehicle was spotted. 

The protocol proposed in [39] has several practical 
drawbacks. First, it requires vehicles to send anonymous 
messages to the server (e.g., by using Tor [19]) impos- 
ing high additional costs to the system. Second, their 
protocol only avoids leaking any additional informa- 
tion beyond what can be deduced from the anonymized 
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database. As the database contains path segments, the 
TSP could use tracking algorithms to recover paths fol- 
lowed by the drivers [29, 27, 32] and infer further infor- 
mation about them. Third, the scalability of the system 
is limited by the complexity of the protocol on the client 
side, as it depends on the number of drivers 1n the system. 
Practical implementations require simplifications such as 
partitioning the set of vehicles into smaller groups, thus 
reducing the anonymity set of the drivers. Fourth, VPriv 
only uses spot checks to verify correctness of the loca- 
tion, and thus needs an extra protocol to verify the cor- 
rect pricing of segments. This extra protocol produces an 
overhead both in terms of computation and communica- 
tion complexity. 

Our solution, similar to PriPAYD [44], does not re- 
quire messages between the OBU and the TSP to be 
anonymous as the computation of the fee is made locally 
and no personal data is sent to the provider. Thus, no 
database of personal data is created and we do not need 
to rely on database anonymization techniques to ensure 
users’ privacy. Further, the OBU’s operations depend 
only on the data it collects, independently of the number 
of vehicles in the system. Finally, our protocol can be 
integrated into a stand-alone OBU without the need of 
external devices to carry out the cryptographic protocols. 

To the best of our knowledge, the only protocol that 
so far employs spot checks to verify both correctness of 
the location and of the fee calculation is due to Jonge and 
Jacobs [17]. In this solution, OBUs commit to segments 
of location data and its corresponding subfees when re- 
porting the total fee to the TSP. They employ hash func- 
tions as commitments. Upon being challenged to ratify 
the information in the spot check, OBUs must provide 
the hash pre-image of the corresponding segment, and 
demonstrate that indeed the location was used to com- 
pute the final fee. 

Jonge and Jacobs’ protocol is limited by the fact that 
using hash-based commitments one cannot prove that the 
commitments to the subfees add to the total fee. As so- 
lution, they propose that the OBU also commits to the 
subfees corresponding to bigger time intervals following 
a tree structure. Each tax period is divided into months, 
each month is divided into weeks, and so forth, and sub- 
fees for each month, week, day,...are calculated and 
committed. Then, instead of asking the OBU to open 
only one commitment containing the instant specified in 
TC’s proof, the TSP asks the OBU to open all the com- 
mitments in the tree that include that instant. This in- 
deed proves that the sum 1s correct at the cost of revealing 
much more information to the TSP. 

PrE'TP avoids this information leakage. The reason is 
that, in our OP scheme, commitments are homomorphic 
and thus allow TSP to check that the commitments to the 
subfees add to the total fee without additional data. The 
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use of homomorphic commitments was also proposed 
and briefly sketched in [17]. However, their scheme does 
not prevent the OBU from committing to a “negative” 
price, which would give a malicious OBU the possibil- 
ity of reducing the final fee by sending only one wrong 
commitment, thus with an overwhelming probability of 
not being detected by the spot checks. 


7 Conclusion 


The revelation of location data in Electronic Toll Pricing 
(ETP) systems, besides conflicting with the users’ right 
to privacy, can also pose inconveniences and extra invest- 
ments to service providers as the law demands that per- 
sonal data is stored and processed under strong security 
guarantees [21]. Furthermore, it has been shown [31] 
that security and privacy concerns are among the main 
reasons that discourage the use of electronic communi- 
cation services. Recent research [45] demonstrates that 
users confronted to a prominent display of private infor- 
mation not only prefer service providers that offer bet- 
ter privacy guarantees but also are willing to pay higher 
prices to utilize more privacy protective systems. Con- 
sequently, it is of interest for service providers to deploy 
systems where the amount of location information that 
users need to disclose is minimized. 

As ETP systems are becoming increasingly impor- 
tant [13, 1], it is a challenge to implement them respect- 
ing both the users’ privacy and the interest of the service 
provider. Previous work relied on too expensive solu- 
tions, or on unrealistic requirements, to fulfill both prop- 
erties. In this work we have presented PrE'TP, an ETP 
system that allows on-board units to prove that they op- 
erate correctly leaking the minimum amount of informa- 
tion. Namely, upon request of the service provider, on- 
board units can attest that the input location data for the 
calculation of the fee is authentic and has not been tam- 
pered with. For this purpose we proposed a new cryp- 
tographic protocol, Optimistic Payment, that we define, 
construct and prove secure under standard assumptions. 
For this protocol, we also provide an efficient instantia- 
tion based on known secure cryptographic primitives. 

We have performed a holistic analysis of PrETP. Be- 
sides the security analysis, we have built an on-board 
unit prototype on an embedded platform, as well as a ser- 
vice provider prototype on a commodity computer, and 
we have thoroughly tested the performance of both using 
real world collected data. The result of our experiments 
confirms that our protocol can be executed in real time 
in an on-board unit constructed with off-the-shelf com- 
ponents. Finally, we have analyzed the legal compliance 
of PrE'T'P under the European Law framework and con- 
clude that it fully supports the Data Protection Directive 
principles. 
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A Security Definition of Optimistic Pay- 
ment 


Ideal-world/real-world paradigm. We use the ideal- 
world/real-world paradigm to prove our construction se- 
cure. In this paradigm, parties are modeled as proba- 
bilistic polynomial time interactive Turing machines. A 
protocol ~ is secure if there exists no environment Z that 
can distinguish whether it is interacting with adversary A 
and parties running protocol y or with the ideal process 
for carrying out the desired task, where ideal adversary 
S and dummy parties interact with an ideal functional- 
ity F,,. More formally, we say that protocol ~ emulates 
the ideal process if, for any adversary A, there exists a 
simulator S such that for all environments Z, the ensem- 
bles IDEAL=,, 5,2 and REAL, 4.2 are computation- 
ally indistinguishable. We refer to [11] for a description 
of these ensembles. 

Our construction operates in the Frgg-hybrid model, 
where parties register their public keys at a trusted reg- 
istration entity and obtain from it a common reference 
string. Below we depict the ideal functionality Frgc, 
which is parameterized with a set of participants P that is 
restricted to contain OBU, TSP and TC only. We also 
describe an ideal functionality Fop for Optimistic Pay- 
ment. Every functionality and every protocol invocation 
should be instantiated with a unique session-ID that dis- 
tinguishes it from other instantiations. For the sake of 
ease of notation, we omit session-IDs from our descrip- 
tion. 


Functionality -pgg. Parameterized with a set of parties 


P, Free works as follows: 

- On input (crs) from party P, if P ¢ P it aborts. Other- 
wise, if there is no value r recorded, it picks r <- D 
and records r. It sends (crs, r) to P. 

- Upon receiving (register,v) from party P € P, it 
records the value (P, v). 

- Upon receiving (retrieve, P) from party P’ € P, if 
(P, v) is recorded then return (retrieve, P,v) to P’. 
Otherwise send (retrieve, P, |) to P’. 


Functionality Fop. Running with OBU, TSP and TC, 


Fop works as follows: 

- On input a message (initialize, f, .) from TSP, where 
f is a mapping f : (loc, time) > Y and pu : 
(d, (loc, time)) — {accept, reject}, Fop stores 
(f, 4) and sends (initialize, f, 7) to OBU. 

- On input a message (payment, tag, fee, (k, (loc,, 
timex), Pk)—,) from OBU, where tag identi- 
fies the tax period, Fop checks that a message 
(payment, tag,...) was not received before, that 
fork = 1to N, pp € Y, and that fee = 
a py. If these checks succeed, op sends 
(payment, tag, fee, N) to TSP and stores the tu- 
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ple (tag, fee, (k, (locy,, timex), pk) \_, ). Otherwise 
Fop sends (payment, tag, |) and stores (tag, L). 

- On input a message (proof, tag,¢) from TC, Fop 
stores (tag, @) and sends (proof, tag, ¢) to TSP. 

- On input a message (verify, tag,@) from TSP, Fop 
checks that it stores messages (payment, tag, .. .) 
and (proof,tag,¢@). If it is the case, Fop 
sends (verifyreq, tag,¢) to OBU. Upon receiv- 
ing (verifyresp, tag, (c, (loc,, time’,), p.)), Fop 
checks whether the stored payment tuple (Kk, (loc,, 
timer), pr) equals (oc, (loc, time’), p.) for k = 
o, whether u(¢, (loc, time’,)) outputs accept, and 
whether p! = f(loc,,time!,). If these checks 
are correct, Fop sends (verifyresul, not guilty, (oc, 
(loc, time’), p.)) to TSP. Otherwise it sends 
(verifyresul, guilty, (c, (loc), time’), p/)). 

- On input a message (blame, tag) from TSP, Fop 
checks that messages (payment, tag,...), (proof, 
tag,) and (verifyresp, tag,...) were previously 
received, and in this case it proceeds with the same 
checks done for (verify, ...). It sends to TC either 


(guilty) or (not guilty). 


B_ Construction of an Optimistic Payment 
Scheme 


We use several existing results to prove statements about 
discrete logarithms: (1) proof of knowledge of a discrete 
logarithm modulo a prime [41]; (2) proof of knowledge 
of the equality of some element in different representa- 
tions [12]; (3) proof with interval checks [37] and (4) 
proof of the disjunction or conjunction of any two of the 
previous [14]. These results are often given in the form 
of }J-protocols but they can be turned into non-interactive 
zero-knowledge arguments in the random oracle model 
via the Fiat-Shamir heuristic [24]. 

When referring to the proofs above, we follow the 
notation introduced by Camenisch and Stadler [10] for 
various proofs of knowledge of discrete logarithms and 
proofs of the validity of statements about discrete loga- 
rithms. NIPK{(a, 8,6) : y = go m2 AY = GR? A 
A < a < B} denotes a “zero-knowledge Proof of 
Knowledge of integers a, 2, and 0 such that y = 
gor, § = go°G,> and A < a < B holds”, where 
Ys 90>91>Y, Go, and gi are elements of some groups G = 
(go) = (gi) and G = (go) = (gi) that have the 
same order. (Note that some elements in the represen- 
tation of y and y are equal.) The convention is that 
letters in the parenthesis, in this example a, (6, and 
0, denote quantities whose knowledge is being proven, 
while all other values are known to the verifier. We de- 
note a non-interactive proof of signature possession as 
NIPK{ (a, sz) : SigVerify(pk, x, sz) = accept}. 
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B.1 Construction 


We begin with a high level description of the optimistic 
payment scheme. We assume that each party registers its 
public key at Frgc, and retrieves public keys from other 
parties by querying Frgg. They also retrieve the com- 
mon reference string paramsCom, Which is computed by 
algorithm SetupOP. 

Optimistic Payment 

When TSP is activated with (initialize, f, 4), TSP 
runs TSPkg(1") to obtain (skpsp, pkypgp), and ob- 
tains a setup params with TSPinit(f, skpgp). TSP 
stores TSP = (f, pu, skrsp, pkpgp, paramscom; 
params) and sends (f, uw, params) to OBU. OBU 
runs OBUkg(1") to get (skopu, pkopy) and ex- 
ecutes OBUinit(params, pkpgp) to get a bit b. If 
b = 0, OBU rejects params. Otherwise OBU 
stores the tuple OBUp = (f, 1, skopu, pkopu; 
pk-pgp, paramscom, params). 

When OBU is activated with (payment, tag, fee, (k, 
(locy, timex), pe) _,) and OBU has previously 
received (f, 4, params), OBU runs algorithm Pay 
(paramscom, params, pkopy, skosu; Pkosp, 
tag, fee, (k, (loc, timex), De)p_,) to obtain a 
payment message m along with a signature s,,, 
and auxiliary information aux. OBU sets aux = 
(aux, (k, (locy, timex), pp) h_,), stores OBUtag = 
(OBUp, m, Sm, aux) and sends (m, 5) to TSP. 
TSP runs VerifyPayment(paramscom, Pkopu; 
pkpgp,M, Sm) to obtain a bit b. If b = 0, TSP 
rejects (M, 5m,). Otherwise TSP stores TSPtag = 
(TSPo,™, Sm, pPkopy).- 

When TC is activated with (proof, tag, @), TC runs 
TCkg(1") to get (pkpo, skrc), runs Prove(skra, 
tag, @) to obtain a proof @ and sends (Q) to TSP. 
TSP runs VerifyProof(pkpo, Q) and aborts if b = 
0. Otherwise TSP stores TSPtag = (TSP tag, Q). 

When TSP is activated with (verify, tag,¢), and 
TSP has previously obtained (m,s,,) and (Q), 
TSP sends (Q) to OBU. OBU executes 
VerifyProof(pkpa,@Q) and aborts if b = 0. 
Otherwise OBU runs OBUopen(skogpu, Q, aux) 
to get a response R and sends (R) to TSP. 
TSP runs Check(paramscom, Pkopy; pkpsp,™;, 
Sm,Q,R) to obtain either (not guilty, (k, (locg, 
timex), Pk) Or (guilty, (k, (lock, timer), pr))- 

When TSP is activated with (blame, tag), and 
messages (™M,5m), (Q) and (R) were previ- 
ously received, TSP sends ((m,s,),R) to TC. 
TC runs Check(paramscom, Pkopy; pkpsp,™;, 
Sm,Q,R) to obtain (not guilty, (k, (locg, timex), 
pr)) or (guilty, (k, (locz, timex), Pk)).- 

In the following, we denote the signature algorithms 
used by TSP, OBU and TC as (TSPkeygen, TSPsign, 
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TSPverify), (OBUkeygen, OBUsign, OBUverify) and 

(TCkeygen, TCsign, TCverify). HT stands for a 

collision-resistant hash function, which is modeled as a 

random oracle. 

SetupOP(1"). Run ComSetup(1") and output 
paramscom: 

TSPkg(1”). Run TSPkeygen(1*) to get a key pair 
(pk-pgp, skrsp). Output (pk pgp, skrsp). 

OBUkg(1"). Run OBUkeygen(1") to get a key pair 
(pk opu; skopu). Output (pkopy, skosu).- 

TCkg(1"). Run TCkeygen(1”) to obtain a key pair 
(pkrc, skrc). Output (pip, skrc). 

TSPinit(f, skpgp). For all possible prices p € YT, 
run s = TSPsign(skpsp,p) and output the set 
params = (p, 8). 

OBvUinit(params, pkpgp). Parse params as (p,s) and 
run TSPverify(pkypgp, p, 5) for all p € YT. If all the 
signatures are correct, output b = 1 else b = 0. 

Pay(paramscom, params, pkopy; Skosu; pkrsp, tag 


fee, (k, (loc, timex), pp) h_,). For k = 1 to 
N, execute hy = AHA/(locg, time,), calculate a 
commitment to the price (cz, open; ) = Commit 


(paramscom, Pk) and compute a proof of posses- 
sion of a signature on the price 7, = NIPK{(pz, 
open,, 8)  : TSPverify(pkpgp, pr, sk) = = 
accept A (cy, open;,) = Commit(paramscom, 
pr)}. Add all the prices to obtain the total fee 
fee and all the openings open, to get an opening 
open,.. to the commitment to the fee. Set payment 
message m = (tag, fee, openjee, (Inks Crs) 1) 
and run s,, = OBUsign(skopgu,m). Output (m, 
Sm) and aux = (open; )i"_,. 
VerifyPayment(paramscom; Pkopu; pk psp, ™, 8m): 
Parse m as (tag, fee, open fees (Nk, Ck; Tk)p—1)+ For 
k = 1 to N, verify 7,. Add all the commitments 
to obtain a commitment to the total fee cyee, and 
run Open(paramscom, Cfee, fee, openye.). Tf the 
opening is correct, output b = 1. Otherwise output 
b= 0: 
Prove(skrc, tag,d). Set q = (tag, @) and run sg = 
TCsign(skrc, q). Output Q = (q, Sq). 
VerifyProof(pkpo,@). Parse @ as (q,s,) and run 
TCverify(pkpo, qd, Sq). Output b = 1 if the signa- 
ture is correct and b = O otherwise. 
OBUopen(skopu,@, au). Parse proof Q as (q, 84), 4 
as (tag,@) and aux as (open,, (k, (lock, timex), 
Pr))-1- Find the data structure (loc, time;,) such 
that ~(¢, (loc, time;,)) outputs accept. Set r = 
(tag, (k, (locy, timex), pe), open;) and run s, = 
OBUsign(skopu,1r). Output R = (7, s,). 
Check(paramscom, Pkopu, Pkrgp,™, 8m; Q, R). 
Parse R as (r,s,) and run OBUverify(pkopy; 
r,s,). If the signature is correct, parse r as 
(tag, (a, (loc,, times), p5), opens), Q as ((tag, 
b),8q) and m as (tag, fee, openyee, (Nk, Ck; 
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Tk)p—1)» Check that open,.. was picked from 
the adequate interval. Compute hi = H/(loc), 
time), check if h) = h, and if p(d, (loc,, 
time’,)) outputs accept. If it is the case, set 
reasonpos = OQ and otherwise reasonpos = 1. 
Compute pp = f(loc,,time’,) and check if 
Do = p. Run Open(paramscom; Ck, Pk, OPCN,)- 
If it opens correctly set reasonprice = O and 
otherwise reasonprice = 1. If reasonpos = 
reasonprice = 0, output (not guilty, (k, (locg, 
timer), Pr)). If not, output (guilty, (k, (lock, 
timer), Pk))- 


Theorem 1 This OP scheme securely realizes Fop. 


We prove Theorem | in the extended version of this 
work [3]. 


B.2 Efficient Instantiation 


We propose an efficient instantiation for the commitment 
scheme, 'T'SP’s signature scheme and the non-interactive 
proof of signature possession that are used in the con- 
struction described in the previous section. The signa- 
ture schemes of ‘TC and OBU can be instantiated with 
any existentially unforgeable signature scheme. 


Signature Scheme. We select the signature scheme pro- 

posed by Camenisch and Lysyanskaya [9]. 

- SigKeygen. On input 1”, generate two safe primes p, q 
of length k such that p = 2p’ + 1 and q = 2q' +1. 
The special RSA modulus of length /,, is defined as 
n = pq. Output secret key sk = (p,q). Choose uni- 
formly at random S Er QRy, and R,Z Ep (S). 
Output public key pk = (n, R, S, Z). 

- SigSign. On input message x of length /,, choose a 
random prime number e of length /. > 1, + 3, and 
a random number v of length 1, = /, + le + [,, 
where /,. is a security parameter [9]. Compute the 
value A such that Z = A°R*S”’(mod n). Output 
the signature (A, e, v). 

- SigVerify. On inputs message x and signature (A, e, 
v), check that Z = A®R* S”(mod n) and Qe ce< 
Qe 


Commitment Scheme. We select the integer commit- 

ment scheme due to Damgard and Fujisaki [15]. 

- ComSetup. Given a special RSA modulus, pick a ran- 
dom generator gj €r QR,. Pick random a «+ 
{0,1}'~*+' and compute go = g@. Output parame- 
ters (go, 91,7). 

- Commit. On input message x of length /,, choose a 
random number open, € {0,1}'"*'=, and compute 
Ce = Go" 91 °P°" «(mod n). Output the commitment 
C, and the opening open... 
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- Open. On inputs message x and opening open,,., com- 
pute ci, = go’gi°?°"=(mod n) and check whether 
4 =— 

Non-Interactive Zero-Knowledge Argument. We em- 
ploy the proof of possession of a signature in [9]. Given 
a signature (A,e,v) on message x and a commitment 
to the message c, = go”gi°?°"*, the prover computes 
A = Ag”, a commitment c,, = g@h°Pe"» anda proof 
that: 


NIPK{ (a, open,., €, VU, W, Open,,, W- e, Open, - e) : 
Cn = G07 G17 NZ = APR®S”(1/ go)” A 
Cw = go gr" AN 1=c,(1/90)"* 
(1/g1)PePw'® A € € {0, Lpletletle A 
rE {0, ppletle tle y 


We turn it into a non-interactive zero-knowledge argu- 
ment via the Fiat-Shamir heuristic. The prover picks ran- 
dom values: 

Pp S401 e ret . Poyen,. 40,1 paren 

Fi Ae ee, Ga WO err 

fs. <= {0, 1}etlet le . Twee ae {0, L}in tle tle tle 
ry <— {0,1}letletle | ropen ec < {0, 1pintletletle 


and computes commitments: 
te. = go'* gi T opens, , le. — g hP openw 
i —_— Ave R™ ST (1/90), 
Aa —_ Ce (1s Go) e(Ly oi) ere 
Let the challenge computed by the prover be: 


ch = H(n||go||91||Al| Rl]S|[1/90]|1/91 | [ex || 2] 
Cw||1||te, [tz Ite. It )- 
The prover computes responses: 


Se =e, —Ch-@ , Sopen, = Topen, — Ch+ open, 
Sy = Tw —Ch-W , Sopen, = open, — Ch: openy 
Se =Te—Ch-e€ , 8ye =Twe—ch-(w-e) 
Sn = Te — CIA 5 


Sopen,, e = VT open, se — CN (Opens, ©) 
and sends to the verifier: 
aT = (A, Cw, ch, Sx, Sopen, > Se, Sv, Sw) Sopen,, > Sw-es 
Sopen,,-e) - 
The verifier computes: 
be = CL! gy 8m gy Serene , Ur = CO gg 8 gy Soren w 
th, = ZA R= $°°( 1/90)" 
th = Cop (1/go)%* (1 / gr )Poreree 
and checks whether: 
Se € {0, L}letletle , Se E {0, [plete tle 
and finally: 


ch = H(nllgollgi||AlLRI|S|[1/go]|1/91llex112 
Cw |[1Ite, [ltzlte,, [If ). 
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An Analysis of Private Browsing Modes in Modern Browsers 


Gaurav Aggarwal Elie Burzstein 
Stanford University 


Abstract 

We study the security and privacy of private browsing 
modes recently added to all major browsers. We first pro- 
pose a clean definition of the goals of private browsing 
and survey its implementation in different browsers. We 
conduct a measurement study to determine how often it is 
used and on what categories of sites. Our results suggest 
that private browsing is used differently from how it is 
marketed. We then describe an automated technique for 
testing the security of private browsing modes and report 
on a few weaknesses found in the Firefox browser. Fi- 
nally, we show that many popular browser extensions and 
plugins undermine the security of private browsing. We 
propose and experiment with a workable policy that lets 
users safely run extensions in private browsing mode. 


1 Introduction 


The four major browsers (Internet Explorer, Firefox, 
Chrome and Safari) recently added private browsing 
modes to their user interfaces. Loosely speaking, these 
modes have two goals. First and foremost, sites visited 
while browsing in private mode should leave no trace on 
the user’s computer. A family member who examines the 
browser’s history should find no evidence of sites visited 
in private mode. More precisely, a local attacker who 
takes control of the machine at time 7’ should learn no 
information about private browsing actions prior to time 
T’. Second, users may want to hide their identity from 
web sites they visit by, for example, making it difficult 
for web sites to link the user’s activities in private mode 
to the user’s activities in public mode. We refer to this as 
privacy from a web attacker. 

While all major browsers support private browsing, 
there is a great deal of inconsistency in the type of pri- 
vacy provided by the different browsers. Firefox and 
Chrome, for example, attempt to protect against a local 
attacker and take some steps to protect against a web at- 
tacker, while Safari only protects against a local attacker. 
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Even within a single browser there are inconsistencies. 
For example, in Firefox 3.6, cookies set in public mode 
are not available to the web site while the browser is in 
private mode. However, passwords and SSL client cer- 
tificates stored in public mode are available while in pri- 
vate mode. Since web sites can use the password man- 
ager as a crude cookie mechanism, the password policy 
is inconsistent with the cookie policy. 

Browser plug-ins and extensions add considerable 
complexity to private browsing. Even if a browser ad- 
equately implements private browsing, an extension can 
completely undermine its privacy guarantees. In Sec- 
tion 6.1 we show that many widely used extensions un- 
dermine the goals of private browsing. For this reason, 
Google Chrome disables all extensions while in private 
mode, negatively impacting the user experience. Firefox, 
in contrast, allows extensions to run in private mode, fa- 
voring usability over security. 


Our contribution. The inconsistencies between the 
goals and implementation of private browsing suggests 
that there is considerable room for research on private 
browsing. We present the following contributions. 


e Threat model. We begin with a clear definition of 
the goals of private browsing. Our model has two 
somewhat orthogonal goals: security against a local 
attacker (the primary goal of private browsing) and 
security against a web attacker. We show that cor- 
rectly implementing private browsing can be non- 
trivial and in fact all browsers fail in one way or an- 
other. We then survey how private browsing is 1m- 
plemented in the four major browsers, highlighting 
the quirks and differences between the browsers. 


e Experiment. We conduct an experiment to test 
how private browsing is used. Our study is based 
on a technique we discovered to remotely test if a 
browser is currently in private browsing mode. Us- 
ing this technique we post ads on ad-networks and 
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determine how often private mode is used. Using ad 
targeting by the ad-network we target different cat- 
egories of sites, enabling us to correlate the use of 
private browsing with the type of site being visited. 
We find it to be more popular at adult sites and less 
popular at gift sites, suggesting that its primary pur- 
pose may not be shopping for “surprise gifts.” We 
quantify our findings in Section 4. 


e Tools. We describe an automated technique for 
identifying failures in private browsing implemen- 
tations and use it to discover a few weaknesses in 
the Firefox browser. 


e Browser extensions. We propose an improve- 
ment to existing approaches to extensions in private 
browsing mode, preventing extensions from unin- 
tentionally leaving traces of the private activity on 
disk. We implement our proposal as a Firefox ex- 
tension that imposes this policy on other extensions. 


Organization. Section 2 presents a threat model for pri- 
vate browsing. Section 3 surveys private browsing mode 
in modern browsers. Section 4 describes our experimen- 
tal measurement of private browsing usage. Section 5 
describes the weaknesses we found in existing private 
browsing implementations. Section 6 addresses the chal- 
lenges introduced by extensions and plug-ins. Section 7 
describes additional related work. Section 8 concludes. 


2 Private browsing: goal and threat model 


In defining the goals and threat model for private brows- 
ing, we consider two types of attackers: an attacker who 
controls the user’s machine (a local attacker) and an at- 
tacker who controls web sites that the user visits (a web 
attacker). We define security against each attacker in 
turn. In what follows we refer to the user browsing the 
web in private browsing mode as the user and refer to 
someone trying to determine information about the user’s 
private browsing actions as the attacker. 


2.1 Local attacker 


Stated informally, security against a local attacker means 
that an attacker who takes control of the machine after 
the user exits private browsing can learn nothing about 
the user’s actions while in private browsing. We define 
this more precisely below. 

We emphasize that the local attacker has no access to 
the user’s machine before the user exits private brows- 
ing. Without this limitation, security against a local at- 
tacker is impossible; an attacker who has access to the 
user’s machine before or during a private browsing ses- 
sion can simply install a key-logger and record all user 
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actions. By restricting the local attacker to “after the 
fact” forensics, we can hope to provide security by hav- 
ing the browser adequately erase persistent state changes 
during a private browsing session. 

As we will see, this requirement is far from simple. 
For one thing, not all state changes during private brows- 
ing should be erased at the end of a private browsing ses- 
sion. We draw a distinction between four types of persis- 
tent state changes: 


1. Changes initiated by a web site without any user in- 
teraction. A few examples in this category include 
setting a cookie, adding an entry to the history file, 
and adding data to the browser cache. 

2. Changes initiated by a web site, but requiring user 
interaction. Examples include generating a client 
certificate or adding a password to the password 
database. 

3. Changes initiated by the user. For example, creating 
a bookmark or downloading a file. 

4. Non-user-specific state changes, such as installing a 
browser patch or updating the phishing block list. 


All browsers try to delete state changes in category (1) 
once a private browsing session is terminated. Failure to 
do so 1s treated as a private browsing violation. However, 
changes in the other three categories are in a gray area 
and different browsers treat these changes differently and 
often inconsistently. We discuss implementations in dif- 
ferent browsers in the next section. 

To keep our discussion general we use the term pro- 
tected actions to refer to state changes that should be 
erased when leaving private browsing. It is up to each 
browser vendor to define the set of protected actions. 


Network access. Another complication in defining pri- 
vate browsing is server side violations of privacy. Con- 
sider a web site that inadvertently displays to the world 
the last login time of every user registered at the site. 
Even if the user connects to the site while in private 
mode, the user’s actions are open for anyone to see. In 
other words, web sites can easily violate the goals of pri- 
vate browsing, but this should not be considered a viola- 
tion of private browsing in the browser. Since we are 
focusing on browser-side security, our security model 
defined below ignores server side violations. While 
browser vendors mostly ignore server side violations, 
one can envision a number of potential solutions: 


e Much like the phishing filter, browsers can consult a 
block list of sites that should not be accessed while 
in private browsing mode. 

e Alternatively, sites can provide a P3P-like policy 
statement saying that they will not violate private 
browsing. While in private mode, the browser will 
not connect to sites that do not display this policy. 
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e A non-technical solution is to post a privacy seal at 
web sites who comply with private browsing. Users 
can avoid non-compliant sites when browsing pri- 
vately. 


Security model. Security is usually defined using two 
parameters: the attacker’s capabilities and the attacker’s 
goals. A local private browsing attacker has the follow- 
ing capabilities: 


e The attacker does nothing until the user leaves pri- 
vate browsing mode at which point the attacker gets 
complete control of the machine. This captures 
the fact that the attacker is limited to after-the-fact 
forensics. 


In this paper we focus on persistent state violations, 
such as those stored on disk; we ignore private state 
left in memory. Thus, we assume that before the 
attacker takes over the local machine all volatile 
memory is cleared (though data on disk, including 
the hibernation file, is fair game). Our reason for ig- 
noring volatile memory is that erasing all of it when 
exiting private browsing can be quite difficult and, 
indeed, no browser does it. We leave it as future 
work to prevent privacy violations resulting from 
volatile memory. 


e While active, the attacker cannot communicate with 
network elements that contain information about the 
user’s activities while in private mode (e.g. web 
sites the user visited, caching proxies, etc.). This 
captures the fact that we are studying the implemen- 
tation of browser-side privacy modes, not server- 
side privacy. 


Given these capabilities, the attacker’s goal is as fol- 
lows: for a set S of HTTP requests of the attacker’s 
choosing, determine if the browser issued any of those 
requests while in private browsing mode. More precisely, 
the attacker is asked to distinguish a private browsing 
session where the browser makes one of the requests in 
S from a private browsing session where the browser 
does not. If the local attacker cannot achieve this goal 
then we say that the browser’s implementation of private 
browsing is secure. This will be our working definition 
throughout the paper. Note that since an HTTP request 
contains the name of the domain visited this definition 
implies that the attacker cannot tell if the user visited a 
particular site (to see why set S to be the set of all pos- 
sible HTTP requests to the site in question). Moreover, 
even if by some auxiliary information the attacker knows 
that the user visited a particular site, the definition im- 
plies that the attacker cannot tell what the user did at the 
site. 
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An alternate definition, which is much harder to 
achieve, requires that the browser hide whether private 
mode was used at all. We will not consider this stronger 
goal in the paper. Similarly, we do not formalize proper- 
ties of private browsing in case the user never exits pri- 
vate browsing mode. 


Difficulties. Browser vendors face a number of chal- 
lenges in securing private browsing against a local at- 
tacker. One set of problems is due to the underlying op- 
erating system. We give two examples: 


First, when connecting to a remote site the browser 
must resolve the site’s DNS name. Operating systems 
often cache DNS resolutions in a local DNS cache. A 
local attacker can examine the DNS cache and the TTL 
values to learn if and when the user visited a particular 
site. Thus, to properly implement private browsing, the 
browser will need to ensure that all DNS queries while 
in private mode do not affect the system’s DNS cache: 
no entries should be added or removed. A more aggres- 
sive solution, supported in Windows 2000 and later, is to 
flush the entire DNS resolver cache when exiting private 
browsing. None of the mainstream browsers currently 
address this issue. 


Second, the operating system can swap memory pages 
to the swap partition on disk which can leave traces of the 
user’s activity. To test this out we performed the follow- 
ing experiment on Ubuntu 9.10 running Firefox 3.5.9: 


1. We rebooted the machine to clear RAM and setup 
and mounted a swap file (zeroed out). 

2. Next, we started Firefox, switched to private brows- 
ing mode, browsed some websites and exited pri- 
vate mode but kept Firefox running. 

3. Once the browser was in public mode, we ran a 
memory leak program a few times to force memory 
pages to be swapped out. We then ran strings 
on the swap file and searched for specific words 
and content of the webpages visited while in private 
mode. 


The experiment showed that the swap file contained 
some URLs of visited websites, links embedded in those 
pages and sometimes even the text from a page — enough 
information to learn about the user’s activity in private 
browsing. 

This experiment shows that a full implementation of 
private browsing will need to prevent browser memory 
pages from being swapped out. None of the mainstream 
browsers currently do this. 


Non-solutions. At first glance it may seem that secu- 
rity against a local attacker can be achieved using virtual 
machine snapshots. The browser runs on top of a vir- 
tual machine monitor (VMM) that takes a snapshot of the 
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browser state whenever the browser enters private brows- 
ing mode. When the user exits private browsing the 
VMM restores the browser, and possibly other OS data, 
to its state prior to entering private mode. This architec- 
ture is unacceptable to browser vendors for several rea- 
sons: first, a browser security update installed during pri- 
vate browsing will be undone when exiting private mode; 
second, documents manually downloaded and saved to 
the file system during private mode will be lost when ex- 
iting private mode, causing user frustration; and third, 
manual tweaks to browser settings (e.g. the homepage 
URL, visibility status of toolbars, and bookmarks) will 
revert to their earlier settings when exiting private mode. 
For all these reasons and others, a complete restore of the 
browser to its state when entering private mode is not the 
desired behavior. Only browser state that reveals infor- 
mation on sites visited should be deleted. 

User profiles provide a lightweight approach to imple- 
menting the VM snapshot method described above. User 
profiles store all browser state associated with a partic- 
ular user. Firefox, for example, supports multiple user 
profiles and the user can choose a profile when start- 
ing the browser. The browser can make a backup of the 
user’s profile when entering private mode and restore the 
profile to its earlier state when exiting private mode. This 
mechanism, however, suffers from all the problems men- 
tioned above. 

Rather than a snapshot-and-restore approach, all four 
major browsers take the approach of not recording cer- 
tain data while in private mode (e.g. the history file is 
not updated) and deleting other data when exiting pri- 
vate mode (e.g. cookies). As we will see, some data that 
should be deleted is not. 


2.2 Web attacker 


Beyond a local attacker, browsers attempt to provide 
some privacy from web sites. Here the attacker does not 
control the user’s machine, but has control over some vis- 
ited sites. There are three orthogonal goals that browsers 
try to achieve to some degree: 


e Goal 1: A web site cannot link a user visiting 
in private mode to the same user visiting in pub- 
lic mode. Firefox, Chrome, and IE implement this 
(partially) by making cookies set in public mode un- 
available while in private mode, among other things 
discussed in the next section. Interestingly, Safari 
ignores the web attacker model and makes public 
cookies available in private browsing. 


e Goal 2: A web site cannot link a user in one private 
session to the same user in another private session. 
More precisely, consider the following sequence of 
visits at a particular site: the user visits in public 
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mode, then enters private mode and visits again, ex- 
its private mode and visits again, re-activates pri- 
vate mode and visits again. The site should not 
be able to link the two private sessions to the same 
user. Browsers implement this (partially) by delet- 
ing cookies set while in private mode, as well as 
other restrictions discussed in the next section. 


e Goal 3: A web site should not be able to determine 
whether the browser is currently in private browsing 
mode. While this is a desirable goal, all browsers 
fail to satisfy this; we describe a generic attack in 
Section 4. 


Goals (1) and (2) are quite difficult to achieve. At 
the very least, the browser’s IP address can help web 
sites link users across private browsing boundaries. Even 
if we ignore IP addresses, a web site can use various 
browser features to fingerprint a particular browser and 
track that browser across privacy boundaries. Mayer [14] 
describes a number of such features, such as screen reso- 
lution, installed plug-ins, timezone, and installed fonts, 
all available through standard JavaScript objects. The 
Electronic Frontier Foundation recently built a web site 
called Panopticlick [6] to demonstrate that most browsers 
can be uniquely fingerprinted. Their browser fingerprint- 
ing technology completely breaks private browsing goals 
(1) and (2) in all browsers. 

Torbutton [29] — a Tor client implemented as a Fire- 
fox extension — puts considerable effort into achieving 
goals (1) and (2). It hides the client’s IP address using the 
Tor network and takes steps to prevent browser finger- 
printing. This functionality is achieved at a considerable 
performance and convenience cost to the user. 


3 A survey of private browsing in modern 
browsers 


All four majors browsers (Internet Explorer 8, Firefox 
3.5, Safari 4, and Google Chrome 5) implement a private 
browsing mode. This feature is called InPrivate in In- 
ternet Explorer, Private Browsing in Firefox and Safari, 
and Incognito in Chrome. 


User interface. Figure | shows the user interface associ- 
ated with these modes in each of the browsers. Chrome 
and Internet Explorer have obvious chrome indicators 
that the browser is currently in private browsing mode, 
while the Firefox indicator is more subtle and Safari only 
displays the mode in a pull down menu. The difference 
in visual indicators has to do with shoulder surfing: can 
a casual observer tell if the user is currently browsing 
privately? Safari takes this issue seriously and provides 
no visual indicator in the browser chrome, while other 
browsers do provide a persistent indicator. We expect 
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that hiding the visual indicator causes users who turn on 
private browsing to forget to turn it off. We give some ev- 
idence of this phenomenon in Section 4 where we show 
that the percentage of users who browse the web in pri- 
vate mode is greater in browsers with subtle visual indi- 
cators. 

Another fundamental difference between the browsers 
is how they start private browsing. IE and Chrome spawn 
a new window while keeping old windows open, thus 
allowing the user to simultaneously use the two modes. 
Firefox does not allow mixing the two modes. When en- 
tering private mode it hides all open windows and spawns 
a new private browsing window. Unhiding public win- 
dows does nothing since all tabs in these windows are 
frozen while browsing privately. Safari simply switches 
the current window to private mode and leaves all tabs 
unchanged. 


Internal behavior. To document how the four imple- 
mentations differ, we tested a variety of browser fea- 
tures that maintain state and observed the browsers’ han- 
dling of each feature in conjunction with private brows- 
ing mode. The results, conducted on Windows 7 using a 
default browser settings, are summarized in Tables 1, 2 
and 3. 

Table I studies the types of data set in public mode 
that are available in private mode. Some browsers block 
data set in public mode to make it harder for web sites to 
link the private user to the pubic user (addressing the web 
attacker model). The Safari column in Table 1 shows 
that Safari ignores the web attacker model altogether and 
makes all public data available in private mode except 
for the web cache. Firefox, IE, and Chrome block ac- 
cess to some public data while allowing access to other 
data. All three make public history, bookmarks and pass- 
words available in private browsing, but block public 
cookies and HTMLS5S local storage. Firefox allows SSL 
client certs set in public mode to be used in private mode, 
thus enabling a web site to link the private session to the 
user’s public session. Hence, Firefox’s client cert pol- 
icy is inconsistent with its cookie policy. IE differs from 
the other three browsers in the policy for form field auto- 
completion; it allows using data from public mode. 

Table 2 studies the type of data set in private mode 
that persists after the user leaves private mode. A lo- 
cal attacker can use data that persists to learn user ac- 
tions in private mode. All four browsers block cook- 
ies, history, and HTMLS local storage from propagating 
to public mode, but persist bookmarks and downloads. 
Note that all browsers other than Firefox persist server 
self-signed certificates approved by the user while in pri- 
vate browsing mode. Lewis [35] recently pointed that 
Chrome 5.0.375.38 persisted the window zoom level for 
URLs across incognito sessions; this issue has been fixed 
as of Chrome 5.0.375.53. 
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Table 3 studies data that is entered in private mode and 
persists during that same private mode session. While 
in private mode, Firefox writes nothing to the history 
database and similarly no new passwords and no search 
terms are saved. However, cookies are stored in mem- 
ory while in private mode and erased when the user ex- 
ists private mode. These cookies are not written to per- 
sistent storage to ensure that if the browser crashes in 
private mode this data will be erased. The browser’s 
web cache is handled similarly. We note that among the 
four browsers, only Firefox stores the list of downloaded 
items in private mode. This list is cleared on leaving pri- 
vate mode. 


3.1 A few initial privacy violation examples 


In Section 5.1 we describe tests of private browsing mode 
that revealed several browser attributes that persist after 
a private browsing session is terminated. Web sites that 
use any of these features leave tracks on the user’s ma- 
chine that will enable a local attacker to determine the 
user’s activities in private mode. We give a few exam- 
ples below. 


Custom Handler Protocol. Firefox implements an 
HTML 5 feature called custom protocol handlers (CPH) 
that enables a web site to define custom protocols, 
namely URLs of the form xyz://site/path where 
xyz 1S acustom protocol name. We discovered that cus- 
tom protocol handlers defined while the browser is in 
private mode persist after private browsing ends. Con- 
sequently, sites that use this feature will leak the fact that 
the user visited these sites to a local attacker. 


Client Certificate. IE, Firefox, and Safari support SSL 
client certificates. A web site can, using JavaScript, in- 
struct the browser to generate an SSL client public/pri- 
vate key pair. We discovered that all these browsers re- 
tain the generated key pair even after private browsing 
ends. Again, if the user visits a site that generates an 
SSL client key pair, the resulting keys will leak the site’s 
identity to the local attacker. When Internet Explorer and 
Safari encounter a self-signed certificate they store it in 
a Microsoft certificate vault. We discovered that entries 
added to the vault while in private mode remain in the 
vault when the private session ends. Hence, if the user 
visits a site that is using a self signed certificate, that in- 
formation will be available to the local attacker even after 
the user leaves private mode. 


SMB Query. Since Internet Explorer shares some un- 
derlying components with Window Explorer it under- 
stands SMB naming conventions such as \\host \ 
mydir\myfile and allows the user to browse files and 
directories. This feature has been used before to steal 
user data [16]. Here we point out that SMB can also be 
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Figure 1: Private browsing indicators in major browsers 
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Table 3: Is the state set in private mode at some point accessible later in the same session? 
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used to undo some of the benefits of private browsing 
mode. Consider the following code : 


<img src="\\[WEB SERVER IP] \image.jpg"> 


When IE renders this tag, it initiates an SMB request to 
the web server whose IP is specified in the image source. 
Part of the SMB request is an NTLM authentication that 
works as follows: first an anonymous connection is tried 
and if it fails IE starts a challenge-response negotiation. 
IE also sends to the server Windows username, Windows 
domain name, Windows computer name even when the 
browser is in InPrivate mode. Even if the user is behind a 
proxy, clears the browser state, and uses InPrivate, SMB 
connections identify the user to the remote site. While 
experimenting with this we found that many ISPs filter 
the SMB port 445 which makes this attack difficult in 
practice. 


4 Usage measurement 


We conducted an experiment to determine how the 
choice of browser and the type of site being browsed af- 
fects whether users enable private browsing mode. We 
used advertisement networks as a delivery mechanism 
for our measurement code, using the same ad network 
and technique previously demonstrated in [10, 4]. 


Design. We ran two simultaneous one-day campaigns: 
a campaign that targeted adult sites, and a campaign 
that targeted gift shopping sites. We also ran a cam- 
paign on news sites as a control. We spent $120 to pur- 
chase 155,216 impressions, split evenly as possible be- 
tween the campaigns. Our advertisement detected pri- 
vate browsing mode by visiting a unique URL in an 
<iframe> and using JavaScript to check whether a link 
to that URL was displayed as purple (visited) or blue (un- 
visited). The technique used to read the link color varies 
by browser; on Firefox, we used the following code: 


if (getComputedStyle(link ). color == 
"reb(51,102,160)” ) 
// Link is purple, private browsing is OFF 
} else { 
// Link is blue, private browsing is ON 
j 


To see why this browser history sniffing technique [1 | ] 
reveals private browsing status, recall that in private 
mode all browsers do not add entries to the history 
database. Consequently, they will color the unique URL 
link as unvisited. However, in public mode the unique 
URL will be added to the history database and the 
browser will render the link as visited. Thus, by reading 
the link color we learn the browser’s privacy state. We 
developed a demonstration of this technique in February 
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2009 [9]. To the best of our knowledge, we are the first 
to demonstrate this technique to detect private browsing 
mode in all major browsers. 


While this method correctly detects all browsers in pri- 
vate browsing, it can slightly over count due to false pos- 
itives. For example, some people may disable the his- 
tory feature in their browser altogether, which will incor- 
rectly make us think they are in private mode. In Firefox, 
users can disable the : visited pseudotag using a Fire- 
fox preference used as a defense against history sniffing. 
Again, this will make us think they are in private mode. 
We excluded beta versions of Firefox 3.7 and Chrome 6 
from our experiment, since these browsers have experi- 
mental visited link defenses that prevent our automated 
experiment from working. However, we note that these 
defenses are not sufficient to prevent web attackers from 
detecting private browsing, since they are not designed to 
be robust against attacks that involve user interaction [3]. 
We also note that the experiment only measures the pres- 
ence of private mode, not the intent of private mode— 
some users may be in private mode without realizing it. 


Results. The results of our ad network experiment are 
shown in Figure 2. We found that private browsing was 
more popular at adult web sites than at gift shopping sites 
and news sites, which shared a roughly equal level of pri- 
vate browsing use. This observation suggests that some 
browser vendors may be mischaracterizing the primary 
use of the feature when they describe it as a tool for buy- 
ing surprise gifts [8, 17]. 


We also found that private browsing was more com- 
monly used in browsers that displayed subtle private 
browsing indicators. Safari and Firefox have subtle in- 
dicators and enforce a single mode across all windows; 
they had the highest rate of private browsing use. Google 
Chrome and Internet Explorer give users a separate win- 
dow for private browsing, and have more obvious private 
browsing indicators; these browsers had lower rates of 
private browsing use. These observations suggest that 
users May remain in private browsing mode for longer if 
they are not reminded of its existence by a separate win- 
dow with obvious indicators. 


Ethics. The experimental design complied with the 
terms of service of the advertisement network. The 
servers logged only information that is typically logged 
by advertisers when their advertisements are displayed. 
We also chose not to log the client’s IP address. We 
discussed the experiment with the institutional review 
boards at our respective institutions and were instructed 
that a formal IRB review was not required because the 
advertisement did not interact or intervene with individ- 
uals or obtain identifiable private information. 
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Figure 2: Observed rates of private browsing use 


5 Weaknesses in current implementations: 
a systematic study 


Given the complexity of modern browsers, a systematic 
method is needed for testing that private browsing modes 
adequately defend against the threat models of Section 2. 
During our blackbox testing in Section 3.1 it became 
clear that we need a more comprehensive way to en- 
sure that all browser features behave correctly in private 
mode. We performed two systematic studies: 


e Our first study is based on a manual review of the 
Firefox source code. We located all points in the 
code where Firefox writes to persistent storage and 
manually verified that those writes are disabled in 
private browsing mode. 


e Our second study is an automated tool that runs 
the Firefox unit tests in private browsing mode and 
looks for changes in persistent storage. This tool 
can be used as a regression test to ensure that new 
browser features are consistent with private brows- 
ing. 


We report our results in the next two sections. 


5.1 A systematic study by manual code re- 
view 


Firefox keeps all the state related to the user’s brows- 
ing activity including preferences, history, cookies, text 
entered in forms fields, search queries, etc. in a Profile 
folder on disk [22]. By observing how and when persis- 
tent modifications to these files occur in private mode we 
can learn a great deal about how private mode is imple- 
mented in Firefox. In this section we describe the results 
of our manual code review of all points in the Firefox 
code that modify files in the Profile folder. 

Our first step was to identify those files in the profile 
folder that contain information about a private browsing 
session. Then, we located the modules in the Mozilla 
code base that directly or indirectly modify these files. 
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Finally, we reviewed these modules to see if they write 
to disk while in private mode. 

Our task was greatly simplified by the fact that all 
writes to files inside the Profile directory are done us- 
ing two code abstractions. The first is nsIFile, a 
cross-platform representation of a location in the filesys- 
tem used to read or write to files [21]. The sec- 
ond is Storage, a SQLite database API that can be 
used by other Firefox components and extensions to 
manipulate SQLite database files [23]. Points in the 
code that call these abstractions can check the current 
private browsing state by calling or hooking into the 
nsiPrivateBrowsingService interface [24]. 

Using this method we located 24 points in the Firefox 
3.6 code base that control all writes to sensitive files in 
the Profile folder. Most had adequate checks for private 
browsing mode, but some did not. We give a few exam- 
ples of points in the code that do not adequately check 
private browsing state. 


e Security certificate settings (stored in file 
cert8.db): stores all security certificate set- 
tings and any SSL certificates that have been 
imported into Firefox either by an authorized 
website or manually by the user. This includes SSL 
client certificates. 


There are no checks for private mode in the code. 
We explained in Section 3.1 that this is a violation 
of the private browsing security model since a lo- 
cal attacker can easily determine if the user visited a 
site that generates a client key pair or installs a client 
certificate in the browser. We also note that certifi- 
cates created outside private mode are usable in pri- 
vate mode, enabling a web attacker to link the user 
in public mode to the same user in private mode. 


e Site-specific preferences (stored in file 
permissions.sqlite): stores many of 
Firefox permissions that are decided on a per-site 
basis. For example, it stores which sites are 
allowed or blocked from setting cookies, installing 
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extensions, showing images, displaying popups, 
etc. 


While there are checks for private mode in the 
code, not all state changes are blocked. Permissions 
added to block cookies, popups or allow add-ons in 
private mode are persisted to disk. Consequently, if 
a user visits some site that attempts to open a popup, 
the popup blocker in Firefox blocks it and displays 
a message with some actions that can be taken. In 
private mode, the “Edit popup blocker preferences” 
option is enabled and users who click on that option 
can easily add a permanent exception for the site 
without realizing that it would leave a trace of their 
private browsing session on disk. When browsing 
privately to a site that uses popups, users might be 
tempted to add the exception, thus leaking informa- 
tion to the local attacker. 


e Download actions (stored in file 
mimeTypes.rdf): the file stores the user’s 
preferences with respect to what Firefox does when 
it comes across known file types like pdf or avi. It 
also stores information about which protocol han- 
dlers (desktop-based or custom protocol handlers) 
to launch when it encounters a non-http protocol 
like mailto [26]. 


There are no checks for private mode in the code. 
As a result, a webpage can install a custom proto- 
col handler into the browser (with the user’s permis- 
sion) and this information would be persisted to disk 
even in private mode. As explained in Section 3.1, 
this enables a local attacker to learn that the user 
visited the website that installed the custom proto- 
col handler in private mode. 


5.2 Anautomated private browsing test us- 
ing unit tests 


All major browsers have a collection of unit tests for 
testing browser features before a release. We automate 
the testing of private browsing mode by leveraging these 
tests to trigger many browser features that can potentially 
violate private browsing. We explain our approach as it 
applies to the Firefox browser. We use MozMill, a Fire- 
fox user-interface test automation tool [20]. Mozilla pro- 
vides about 196 MozMill tests for the Firefox browser. 


Our approach. We start by creating a fresh browser 
profile and set preferences to always start Firefox in pri- 
vate browsing mode. Next we create a backup copy of 
the profile folder and start the MozMill tests. We use 
two methods to monitor which files are modified by the 
browser during the tests: 
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e fs_usage is a Mac OSX utility that presents sys- 
tem calls pertaining to filesystem activity. It out- 
puts the name of the system call used to access the 
filesystem and the file descriptor being acted upon. 
We built a wrapper script around this tool to map 
the file descriptors to actual pathnames using 1sof. 
We run our script in parallel with the browser and 
the script monitors all files that the browser writes 
to. 


e We also use the “last modified time” for files in 
the profile directory to identity those files that are 
changed during the test. 


Once the MozMill test completes we compare the modi- 
fied profile files with their backup versions and examine 
the exact changes to eliminate false positives. In our ex- 
periments we took care to exclude all MozMill tests like 
“testPrivateBrowsing” that can turn off private browsing 
mode. This ensured that the browser was in private mode 
throughout the duration of the tests. 

We did the above experiment on Mac OSX 10.6.2 and 
Windows Vista running Firefox 3.6. Since we only con- 
sider the state of browser profile and start with a clean 
profile, the results should not depend on OS or state of 
the machine at the time of running the tests. 


Results. After running the MozMill tests we discovered 
several additional browser features that leak information 
about private mode. We give a few examples. 


e Certificate Authority (CA) Certificates (stored in 
cert8.db). Whenever the browser receives a cer- 
tificate chain from the server, it stores all the cer- 
tificate authorities in the chain in cert 8.db. Our 
tests revealed that CA certs cached in private mode 
persist when private mode ends. This is significant 
privacy violation. Whenever the user visits a site 
that uses a non-standard CA, such as certain govern- 
ment sites, the browser will cache the corresponding 
CA cert and expose this information to the local at- 
tacker. 


e SQLite databases. The tests showed that the last 
modified timestamps of many SQLite databases in 
the profile folder are updated during the test. But at 
the end of the tests, the resulting files have exactly 
the same size and there are no updates to any of the 
tables. Nevertheless, this behavior can exploited by 
a local attacker to discover that private mode was 
turned on in the last browsing session. The attacker 
simply observes that no entries were added to the 
history database, but the SQLite databases were ac- 
cessed. 


e Search Plugins (stored in search.sqlite and 
search.json). Firefox supports auto-discovery 
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of search plugins [19, 25] which is a way for web 
sites to advertise their Firefox search plugins to the 
user. The tests showed that a search plugin added in 
private mode persists to disk. Consequently, a local 
attacker will discover that the user visited the web 
site that provided the search plugin. 


e Plugin Registration (stored in pluginreg.dat). 
This file is generated automatically and records in- 
formation about installed plugins like Flash and 
Quicktime. We observed changes in modification 
time, but there were only cosmetic changes in the 
file content. However, as with search plugins, new 
plugins installed in private mode result in new in- 
formation written to pluginreg.dat. 


Discovering these leaks using MozMill tests 1s much eas- 
ier than a manual code review. 


Using our approach as a regression tool. Using exist- 
ing unit tests provides a quick and easy way to test private 
browsing behavior. However, it would be better to in- 
clude testcases that are designed specifically for private 
mode and cover all browser components that could po- 
tentially write to disk. The same suite of testcases could 
be used to test all browsers and hence would bring some 
consistency in the behavior of various browsers in private 
mode. 

As a proof of concept, we wrote two MozMill testcases 
for the violations discovered in Section 5.1: 


e Site-specific Preferences (stored in file 
permissions.sqlite): visits a fixed URL 
that open up a popup. The test edits preferences to 
allow a popup from this site. 

e Download Actions (mimeTypes.rdf): visits a 
fixed URL that installs a custom protocol handler. 


Running these tests using our testing script revealed 
writes to both profile files involved. 


6 Browser addons 


Browser addons (extensions and plug-ins) pose a privacy 
risk to private browsing because they can persist state to 
disk about a user’s behavior in private mode. The devel- 
opers of these addons may not have considered private 
browsing mode while designing their software, and their 
source code is not subject to the same rigorous scrutiny 
that browsers are subjected to. Each of the different 
browsers we surveyed had a different approach to addons 
in private browsing mode: 


e Internet Explorer has a configurable “Disable 
Toolbars and Extensions when InPrivate Browsing 
Mode Starts” menu option, which is checked by de- 
fault. When checked, extensions (browser helper 
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objects) are disabled, although plugins (ActiveX 
controls) are still functional. 


e Firefox allows extensions and plugins to function 
normally in Private Browsing mode. 


e Google Chrome disables most extension function- 
ality in Incognito mode. However, plugins (includ- 
ing plugins that are bundled with extensions) are en- 
abled. Users can add exceptions on a per-extension 
basis using the extensions management interface. 


e Safari does not have a supported extension API. 
Using unsupported APIs, it is possible for exten- 
sions to run in private browsing mode. 


In Section 6.1, we discuss problems that can occur in 
browsers that allow extensions in private browsing mode. 
In Section 6.2 we discuss approaches to address these 
problems, and we implement a mitigation in Section 6.3. 


6.1 Extensions violating private browsing 


We conducted a survey of extensions to find out if they 
violated private browsing mode. This section describes 
our findings. 


Firefox. We surveyed the top 40 most popular add-ons 
listed at http://addons.mozilla.org. Some of 
these extensions like “Cooliris” contain binary compo- 
nents (native code). Since these binary components exe- 
cute with the same permissions as those of the user, the 
extensions can, in principle, read or write to any file on 
disk. This arbitrary behavior makes the extensions dif- 
ficult to analyze for private mode violations. We regard 
all binary extensions as unsafe for private browsing and 
focus our attention only on JavaScript-only extensions. 

To analyze the behavior of JavaScript-only extensions, 
we observed all persistent writes they caused when the 
browser is running in private mode. Specifically, for each 
extension, we install that extension and remove all other 
extensions. Then, we run the browser for some time, do 
some activity like visiting websites and modifying ex- 
tension options so as to exercise as many features of the 
extension as possible and track all writes that happen dur- 
ing this browsing session. A manual scan of the files and 
data that were written then tells us if the extension vio- 
lated private mode. If we find any violations, the exten- 
sion is unsafe for private browsing. Otherwise, it may or 
may not be safe. 

Tracking all writes caused by extensions is easy as al- 
most all JavaScript-only extensions rely on either of the 
following three abstractions to persist data on disk: 


e nsIFile is a cross-platform representation of 
a location in the filesystem. It can be used 
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to create or remove files/directories and write 
data when used in combination with compo- 
nents such as nsIFileOutputStream and 
nsISafeOutputstream. 


e Storage is a SQLite database API [23] 
and can be used to create, remove, open or 
add new entries to SQLite databases using 
components like mozIStorageService, 
mozIStorageStatement and 
mozIiStorageConnection. 


e Preferences can be used to store preferences 
containing key-value (boolean, string or integer) 
pairs using components like nsTPrefService, 
nsIPrefBranch and nsIPrefBranch2. 


We instrumented Firefox (version 3.6 alphal pre, co- 
denamed Minefield) by adding log statements in all func- 
tions in the above Mozilla components that could write 
data to disk. This survey was done on a Windows Vista 
machine. 

Out of the 32 JavaScript-only extensions, we did not 
find any violations for 16 extensions. Some of these ex- 
tensions like “Google Shortcuts” did not write any data 
at all and some others like “Firebug” only wrote boolean 
preferences. Other extensions like “1-Click YouTube 
Video Download” only write files that users want to 
download whereas “FastestFox” writes bookmarks made 
by the user. Notably, only one extension (“Tab Mix 
Plus’) checks for private browsing mode and disables the 
UI option to save session if it 1s detected. 

For 16 extensions, we observed writes to disk that can 
allow an attacker to learn about private browsing activity. 
We provide three categories of the most common viola- 
tions below: 


e URL whitelist/blocklist/queues. Many extensions 
maintain a list of special URLs that are always ex- 
cluded from processing. For instance, “NoScript” 
extension blocks all scripts running on visited web- 
pages. User can add sites to a whitelist for which 
it should allow all scripts to function normally. 
Such exceptions added in private mode are persisted 
to disk. Also, downloaders like “DownThemAI]”’ 
maintain a queue of URLs to download from. This 
queue is persisted to disk even in private mode and 
not cleared until download completes. 


e URL Mappings. Some extensions allow specific 
features or processing to be enabled for specific 
websites. For instance, “Stylish” allows different 
CSS styles to be used for rendering pages from dif- 
ferent domains. The mapping of which style to use 
for which website is persisted to disk even in private 
mode. 
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e Timestamp. Some extensions store a timestamp in- 
dicating the last use of some feature or resource. For 
instance, “Personas” are easy-to-use themes that let 
the user personalize the look of the browser. It 
stores a timestamp indicating the last time when the 
theme was changed. This could potentially be used 
by an attacker to learn that private mode was turned 
on by comparing this timestamp with the last times- 
tamp when a new entry was added to the browser 
history. 


It is also interesting to note that the majority of the ex- 
tensions use Preferences or nsIFile to store their 
data and very few use the SQLite database. Out of the 
32 JavaScript-only extensions, only two use the SQLite 
database whereas the rest of them use the former. 


Google Chrome. Google launched an extension plat- 
form for Google Chrome [5] at the end of January 2010. 
We have begun a preliminary analysis of the most popu- 
lar extensions that have been submitted to the official ex- 
tensions gallery. Of the top 100 extensions, we observed 
that 71 stored data to disk using the localStorage 
API. We also observed that 5 included plugins that can 
run arbitrary native code, and 4 used Google Analytics to 
store information about user behavior on a remote server. 
The significant use of local storage by these extensions 
suggests that they may pose a risk to Incognito. 


6.2 Running extensions in private brows- 
ing 

Current browsers force the user to choose between run- 
ning extensions in private browsing mode or blocking 
them. Because not all extensions respect private brows- 
ing mode equally, these policies will either lead to pri- 
vacy problems or block extensions unnecessarily. We 
recommend that browser vendors provide APIs that en- 
able extension authors to decide which state should be 
persisted during private browsing and which state should 
be cleared. There are several reasonable approaches that 
achieve this goal: 


e Manual check. Extensions that opt-in to running in 
private browsing mode can detect the current mode 
and decide whether or not to persist state. 


e Disallow writes. Prevent extensions from changing 
any local state while in private browsing mode. 


e Override option. Discard changes made by ex- 
tensions to local state while in private browsing 
mode, unless the extension explicitly indicates that 
the write should persist beyond private browsing 
mode. 
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Several of these approaches have been under discus- 
sion on the Google Chrome developers mailing list [28]. 
We describe our implementation of the first variant in 
Section 6.3. We leave the implementation of the latter 
variants for future work. 


6.3. Extension blocking tool 


To implement the policy of blocking extensions from 
running in private mode as described in section 6.2, 
we built a Firefox extension called ExtensionBlocker 
in 371 lines of JavaScript. Its basic functionality 
is to disable all extensions that are not safe for pri- 
vate mode. So, all unsafe extensions will be disabled 
when the user enters private mode and then re-enabled 
when the user leaves private mode. An extension is 
considered safe for private mode if its manifest file 
(install.rdf for Firefox extensions) contains a new 
XML tag <privateModeCompatible/>. Table 4 
shows a portion of the manifest file of ExtensionBlocker 
declaring that it is safe for private browsing. 

ExtensionBlocker subscribes to the 
nsIPrivateBrowsingService to observe 
transitions into and out of private mode. Whenever 
private mode is enabled, it looks at each enabled 
extension in turn, checks their manifest file for the 
<privateModeCompatible/> tag and disables 
the extension if no tag is found. Also, it saves the list 
of extensions that were enabled before going to private 
mode. Lastly, when the user switches out of private 
mode, it re-enables all extensions in this saved list. At 
this point, it also cleans up the saved list and any other 
state to make sure that we do not leave any traces behind. 

One implementation detail to note here is that we need 
to restart Firefox to make sure that appropriate exten- 
sions are completely enabled or disabled. This means 
that the browser would be restarted at every entry into or 
exit from private mode. However, the public browsing 
session will still be restored after coming out of private 
mode. 


7 Related work 


Web attacker. Most work on private browsing focuses 
on security against a web attacker who controls a num- 
ber of web sites and is trying to determine the user’s 
browsing behavior at those sites. Torbutton [29] and Fox- 
Tor [31] are two Firefox extensions designed to make it 
harder for web sites to link users across sessions. Both 
rely on the Tor network for hiding the client’s IP address 
from the web site. PWS [32] is a related Firefox exten- 
sion designed for search query privacy, namely prevent- 
ing a search engine from linking a sequence of queries to 
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a specific user. 

Earlier work on private browsing such as [34] focused 
primarily on hiding the client’s IP address. Browser fin- 
gerprinting techniques [1, 14, 6] showed that additional 
steps are needed to prevent linking at the web site. Tor- 
button [29] 1s designed to mitigate these attacks by block- 
ing various browser features used for fingerprinting the 
browser. 

Other work on privacy against a web attacker includes 
Janus [7], Doppelganger [33] and Bugnosis [2]. Janus 
is an anonymity proxy that also provides the user with 
anonymous credentials for logging into sites. Doppel- 
ganger [33] is a client-side tool that focuses on cookie 
privacy. The tool dynamically decides which cookies 
are needed for functionality and blocks all other cook- 
ies. Bugnosis [2] is a Firefox extension that warns users 
about server-side tracking using web bugs. Millet et al. 
carry out a study of cookie policies in browsers [18]. 

P3P is a language for web sites to specify privacy poli- 
cies. Some browsers let users configure the type of sites 
they are willing to interact with. While much work went 
into improving P3P semantics [13, 27, 30] the P3P mech- 
anism has not received widespread adoption. 


Local attacker. In recent years computer forensics ex- 
perts developed an array of tools designed to process the 
browser’s cache and history file in an attempt to learn 
what sites a user visited before the machine was con- 
fiscated [12]. Web historian, for example, will crawl 
browser activity files and report on all recent activity 
done using the browser. The tool supports all major 
browsers. The Forensic Tool Kit (FTK) has similar func- 
tionality and an elegant user interface for exploring the 
user’s browsing history. A well designed private brows- 
ing mode should successfully hide the user’s activity 
from these tools. 

In an early analysis of private browsing modes, 
McKinley [15] points out that the Flash Player and 
Google Gears browser plugins violate private browsing 
modes. Flash player has since been updated to be con- 
sistent with the browser’s privacy mode. More generally, 
NPAPI, the plugin API, was extended to allow plugins 
to query the browser’s private browsing settings so that 
plugins can modify their behavior when private brows- 
ing is turned on. We showed that the problem is more 
complex for browser extensions and proposed ways to 
identify and block problematic extensions. 


$ Conclusions 


We analyzed private browsing modes in modern 
browsers and discussed their success at achieving the de- 
sired security goals. Our manual review and automated 
testing tool pointed out several weaknesses in existing 
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<em:targetApplication> 
<Description> 


<em:id>{ ec8030f7 —c20a —464f —9b0e —13a3a9e97384 }</em:id> 
<em:minVersion>1.5</em:minVersion> 
<em:maxVersion>3.*«</em:max Version> 


<em:privateModeCompatible/> 
</Description> 
</em:targetApplication> 


Table 4: A portion of the manifest file of ExtensionBlocker 


implementations. The most severe violations enable a 
local attacker to completely defeat the benefits of private 
mode. In addition, we performed the first measurement 
study of private browsing usage in different browsers and 
on different sites. Finally, we examined the difficult is- 
sues of keeping browser extensions and plug-ins from 
undoing the goals of private browsing. 


Future work. Our results suggest that current private 
browsing implementations provide privacy against some 
local and web attackers, but can be defeated by deter- 
mined attackers. Further research is needed to design 
stronger privacy guarantees without degrading the user 
experience. For example, we ignored privacy leakage 
through volatile memory. Is there a better browser ar- 
chitecture that can detect all relevant private data, both 
in memory and on disk, and erase it upon leaving pri- 
vate mode? Moreover, the impact of browser extensions 
and plug-ins on private browsing raises interesting open 
problems. How do we prevent uncooperative and legacy 
browser extensions from violating privacy? In browsers 
like IE and Chrome that permit public and private win- 
dows to exist in parallel, how do we ensure that exten- 
sions will not accidentally transfer data from one window 
to the other? We hope this paper will motivate further re- 
search on these topics. 
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Abstract 


A key feature that distinguishes modern botnets from 
earlier counterparts is their increasing use of structured 
overlay topologies. This lets them carry out sophisticated 
coordinated activities while being resilient to churn, but 
it can also be used as a point of detection. In this 
work, we devise techniques to localize botnet mem- 
bers based on the unique communication patterns aris- 
ing from their overlay topologies used for command and 
control. Experimental results on synthetic topologies 
embedded within Internet traffic traces from an ISP’s 
backbone network indicate that our techniques (1) can lo- 
calize the majority of bots with low false positive rate, 
and (11) are resilient to incomplete visibility arising from 
partial deployment of monitoring systems and measure- 
ment inaccuracies from dynamics of background traffic. 


1 Introduction 


Malware is an extremely serious threat to modern net- 
works. In recent years, a new form of general-purpose 
malware known as bots has arisen. Bots are unique in 
that they collectively maintain communication structures 
across nodes to resiliently distribute commands from a 
command and control (C&C) node. The ability to coor- 
dinate and upload new commands to bots gives the bot- 
net owner vast power when performing criminal activi- 
ties, including the ability to orchestrate surveillance at- 
tacks, perform DDoS extortion, sending spam for pay, 
and phishing. This problem has worsened to a point 
where modern botnets control hundreds of thousands of 
hosts and generate revenues of millions of dollars per 
year for their owners [23, 42]. 

Early botnets followed a centralized architecture. 
However, growing size of botnets, as well as the devel- 
opment of mechanisms that detect centralized command- 
and-control servers [10, 44, 27, 31, 72, 9, 49, 30, 29, 76], 
has motivated the design of decentralized peer-to-peer 
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botnets. Several recently discovered botnets, such as 
Storm, Peacomm, and Conficker, have adopted the use 
of structured overlay networks [71, 57, 58]. These net- 
works are a product of research into efficient communi- 
cation structures and offer a number of benefits. Their 
lack of centralization means a botnet herder can join 
and control at any place, simplifying ability to evade 
discovery. The topologies themselves provide low de- 
lay any-to-any communication and low control overhead 
to maintain the structure. Further, structured overlay 
mechanisms are designed to remain robust in the face of 
churn [48, 32], an important concern for botnets, where 
individual machines may be frequently disinfected or 
simply turned off for the night. Finally, structured over- 
lay networks also have protection mechanisms against 
active attacks [12]. 


In this work, we examine the question of whether ISPs 
can detect these efficient communication structures of 
peer-to-peer (P2P) botnets and use this as a basis for bot- 
net defense. ISPs, enterprise networks, and IDSs have 
significant visibility into these communication patterns 
due to the potentially large number of paths between 
bots that traverse their routers. Yet the challenge is sep- 
arating botnet traffic from background Internet traffic, as 
each botnet node combines command-and-control com- 
munication with the regular connections made by the ma- 
chine’s user. In addition, the massive scale of the com- 
munications makes it challenging to perform this task ef- 
ficiently. 


We propose BotGrep, an algorithm that isolates effi- 
cient peer-to-peer communication structures solely based 
on the information about which pairs of nodes commu- 
nicate with one another (communication graph). Our 
approach relies on the fast-mixing nature of the struc- 
tured P2P botnet C&C graph [26, 11, 6, 79]. The Bot- 
Grep algorithm iteratively partitions the communication 
graph into a faster-mixing and a slower-mixing piece, 
eventually narrowing on to the fast-mixing component. 
Although graph analysis has been applied to botnet and 
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P2P detection [15, 36, 78, 35], our approach exploits the 
spatial relationships in communication traffic to a sig- 
nificantly larger extent than these works. Based on ex- 
perimental results, we find that under typical workloads 
and topologies our techniques localize 93-99% of botnet- 
infected hosts with a false positive probability of less 
than 0.6%, even when only a partial view of the commu- 
nication graph is available. We also develop algorithms 
to run BotGrep in a privacy-preserving fashion, such that 
each ISP keeps its share of the communication graph pri- 
vate, and show that it can still be executed with access to 
a moderate amount of computing resources. 

The BotGrep algorithm is content agnostic, thus it is 
not affected by the choice of ports, encryption, or other 
content-based stealth techniques used by bots. However, 
BotGrep must be paired with some sort of malware de- 
tection scheme, such as anomaly or misuse detection, 
to be able to distinguish botnet control structures from 
other applications using peer-to-peer communication. A 
promising approach starts with a honeynet that “traps” a 
number of bots. BotGrep is then able to take this small 
seed of bot nodes and recover the rest of the botnet com- 
munication structure and nodes. 


Roadmap: We start by giving a more detailed prob- 
lem description in Section 2. In Section 3, we describe 
our overall approach and core algorithms, and describe 
privacy-preserving extensions that enable sharing of ob- 
servations across ISP boundaries in Section 4. We then 
evaluate performance of our algorithms on synthetic bot- 
net topologies embedded in real Internet traffic traces in 
Section 5. We provide a brief discussion of remaining 
challenges in Section 6, and describe related work in Sec- 
tion 7. Finally, we conclude in Section 8. 


2 System Architecture 


In this section we describe several challenges involved in 
detecting botnets. We then describe our overall architec- 
ture and system design. 


Challenges: Over the recent years, botnets have been 
adapting in order to evade detection and their activities 
have become increasingly stealthy. Botnets use random 
ports, encrypt their communication contents, thus defeat- 
ing content-based identification. Traffic patterns, which 
have previously been used for detection [29], could po- 
tentially be altered as well, using content padding or 
other approaches. However, overall, it seems hard to hide 
the fact that two nodes are communicating, and thus we 
use this information as the basis for our design. 
However, we are faced with several additional chal- 
lenges. The background traffic on the Internet is highly 
variable and continuously changing, and likely dwarfs 
the small amount of control traffic exchanged between 
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botnet hosts. Further, botnet nodes combine their ma- 
licious activity with the regular traffic of the legitimate 
users, thus they are deeply embedded inside the back- 
ground communication topology. For example, Fig- 
ure 1(b) shows a visualization of a synthetic P2P bot- 
net graph embedded within a communication graph col- 
lected from the Abilene Internet2 ISP. The botnet is 
tightly integrated and cannot be separated from the rest 
of the nodes by a small cut. 

In order to observe a significant fraction of botnet 
C&C traffic, it is necessary to combine observations from 
many vantage points across multiple ISPs. This creates 
an extremely large volume of data, since originally the 
background traffic will be captured as well. Thus, any 
analysis algorithms face a significant scaling challenge. 
In addition, although ISPs have already demonstrated 
their willingness to detect misbehavior in order to better 
serve their customers [3] as well as cooperating across 
administrative boundaries [4], they may be reluctant to 
share traffic observations, as those may reveal confiden- 
tial information about their business operations or their 
customers. 

We next propose a botnet defense architecture that ad- 
dresses these challenges. 


System architecture : As a first step, our approach 
requires collecting a communication graph, where the 
nodes represent Internet hosts and edges represent com- 
munication (of any sort) between them. Portions of this 
graph are already being collected by various ISPs: the 
need to perform efficient accounting, traffic engineer- 
ing and load balancing, detection of malicious and dis- 
allowed activity, and other factors, have already led net- 
work operators to deploy infrastructure to monitor traffic 
across multiple vantage points in their networks. Bot- 
Grep operates on a graph that is obtained by combin- 
ing observations across these points into a single graph, 
which offers significant, though incomplete visibility 
into the overall communication of Internet hosts !. Traf- 
fic monitoring itself has been studied in previous work 
(e.g., [44]), and hence our focus in this work is not on 
architectural issues but rather on building scalable botnet 
detection algorithms to operate on such an infrastructure. 

A second source of input is misuse detection. Since 
botnets use communication structures similar to other 
P2P networks, the communication graph alone may not 


'Tools such as Cisco IOS’s NetFlow [2] are designed to sample 
traffic by only processing one out of every 500 packets (by default). 
To evaluate the effect of sampling, we replayed packet-level traces col- 
lected by the authors of [42] from Storm botnet nodes, and simulated 
NetFlow to determine the fraction of botnet links that would be de- 
tected. We found that in the worst case (assuming each flow traversed a 
different router), after 50 minutes, 100% of botnet links were detected. 
Moreover, recent advances in counter architectures [77] may enable 
efficient tracking of the entire communication graph without need for 
sampling. 
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Figure 1: (a) BotGrep architecture and (b) Abilene network with embedded P2P subgraph 


be enough to distinguish the two. Some form of indica- 
tion of malicious activity, such as botnet nodes trapped in 
Honeynets [68] or scanning behavior detected by Dark- 
nets [7], is therefore necessary. A list of misbehaving 
hosts can act as an initial “seed” to speed up botnet iden- 
tification, or it can be used later to verify that the detected 
network is indeed malicious. 

The next step is to isolate a botnet communication sub- 
graph. Recently, botnet creators have been turning to 
communication graphs provided by structured networks, 
both due to their advantages in terms of efficiency and 
resilience, and due to easy availability of well-tested 
implementations of the structured P2P algorithms (e.g., 
Storm bases the C&C structure for its supernodes on the 
Overnet implementation of Kademlia [50]). One com- 
mon feature of these structured graphs is their fast mix- 
ing time, 1.e., the convergence time of random walks to a 
stationary distribution. Our algorithm exploits this prop- 
erty by performing random walks to identify fast-mixing 
component(s) and isolate them from the rest of the com- 
munication graph. If sharing of sensitive information 
is an issue, it is possible to perform random walks in a 
privacy-preserving fashion on a graph that is split among 
a collection of ISPs. 

Once the botnet C&C structure is identified and con- 
firmed as malicious, BotGrep outputs a set of suspect 
hosts. This list may be used to install blacklists into 
routers, to configure intrusion detection systems, fire- 
walls, and traffic shapers; or as “hints” to human oper- 
ators regarding which hosts should be investigated. The 
list may also be distributed to subscribers of the service, 
potentially providing a revenue stream. The overall ar- 
chitecture is shown in Figure 1I(a). 


3 Inference Algorithm 


Our inference algorithm starts with a communication 
graph G = (V,E) with V representing the set of hosts 
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observed in traffic traces and undirected edges e € EF in- 
serted between communicating hosts. Embedded within 
G is a P2P graph Gy CG, and the remaining subgraph 
G, = G—G, containing non-P2P communications. The 
goal of our algorithms is to reliably partition the input 
graph G into {G,,G,,} in the presence of dynamic back- 
ground traffic and with only partial visibility. 


3.1 Approach overview 


The main idea behind our approach is that, since most 
P2P topologies are much more highly structured than 
background Internet traffic, we can partition by detect- 
ing subgraphs that exhibit different topological patterns 
from each other or the rest of the graph. We do this 
by performing random walks, and comparing the relative 
mixing rates of the P2P subgraph structure and the rest 
of the communication graph. The subgraph correspond- 
ing to structured P2P traffic is expected to have a faster 
mixing rate than the subgraph corresponding to the rest 
of the network traffic. The challenge of the problem is to 
partition the graph into these two subgraphs when they 
are not separated by a small cut, and to do so efficiently 
for very large graphs. 


Our approach consists of three key steps. Since the 
input graph could contain millions of nodes, we first ap- 
ply a prefiltering step to extract a smaller set of candi- 
date peer-to-peer nodes. This set of nodes contains most 
peer-to-peer nodes, as well as false positives. Next, we 
use a clustering technique based on the SybillInfer algo- 
rithm [21] to cluster only the peer-to-peer nodes, and re- 
move false positives. The final step involves validating 
the result of our algorithms based on fast-mixing charac- 
teristics of peer-to-peer networks. 
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3.2 Prefiltering Step 


The key idea in the prefiltering step is that for short ran- 
dom walks, the state probability mass associated with 
nodes in the fast-mixing subgraph is likely to be closer to 
the stationary distribution than nodes in the slow-mixing 
subgraph. Let P be the transition matrix of the random 
walks. P is defined as 


(1) 


pe= i ifi — jis anedge inG 
- QO otherwise 
where d; denotes the degree of vertex i in G. 
The probability associated with each vertex after the 
short random walk of length t, denoted by g’, can be be 
used as a metric to compare vertices and guide the ex- 
traction of the P2P subgraph. The initial probability dis- 
tribution qg° is set to g? = 1/|V|, which means that the 
walk starts at all nodes with the equal probability. We 
can recursively compute gq’ as follows: 





g=q'-P (2) 


Now, since nodes in the fast-mixing subgraph are 
likely to have gq’ values closer to the stationary distri- 
bution than nodes in the slow-mixing subgraph, and be- 
cause the stationary distribution is proportional to node 


degrees, we can cluster nodes with homogeneous oi val- 
ues. However, before doing so, we apply a transfor- 
mation to dampen the negative effects of high-degree 
nodes on structured graph detection. High-degree nodes 
or hubs are responsible for speeding up the mixing rate 
of the non-structured subgraph G, and can reduce the 
relative mixing rate of G, as compared to G,,. The trans- 
formation filter is as follows: 


‘NG 
s=(4) , 3) 


where 7 is the dampening constant. We can now cluster 
vertices in the graph by using the k-means algorithm [47] 
on the set of values s. The k-means clustering algorithm 
divides the points in s into k (k < |V|) clusters such that 
the sum of squares J from points to the assigned cluster 
centers 1s minimized. 
IV 

sill’, (4) 


Ms 


k 
Te 
j=l 


where c ; is the center of cluster 7. The within-cluster sum 
of squares for each cluster constitutes the cluster score. 
The parameter k is chosen using the method of Pelleg 
and Moore [56]. Starting from a user specified minimum 
number of clusters k = kj, we repeatedly compute k- 
means over our dataset by incrementing k up to a max- 
imum of kg. We then select the best-scoring k value. 


| 
— 


l 
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kmin and Kmax correspond to the minimum and maximum 
number of possible botnets within the dataset. In our ex- 
periments, we used Kyjn = O and Kingy = 20. 

Each of the k clusters corresponds to a set of nodes 
in Vg, SO we may partition our graph into subgraphs 
{G,,G2,...,G,}. We must now confirm or reject the hy- 
pothesis that each of these subgraphs contains a struc- 
tured P2P graph. Clustering helps speed up the super- 
linear components of the following algorithm; we may 
also be able to focus our attention on a particular sub- 
set of clusters if misuse detection is concentrated within 
them. 

Note that we can use the sparse nature of the ma- 
trix P to compute g’ using Equation 2 very efficiently 
in O(|E|-t) time. The time and space complexity of 
Equation 3 is O(|V|), while Equation 4 can be computed 
in O(k-|V|) iterations. Thus the prefiltering step is a 
very efficient mechanism to obtain a set of candidate P2P 
nodes, capable of operating on large node graphs. 





3.3. Clustering P2P Nodes 


The subgraphs computed by the above step are likely 
to contain P2P nodes, but they are also likely to con- 
tain some non-P2P nodes due to the “leakage” of random 
walks out of the structured subgraph. We perform a sec- 
ond pass over the each subgraph G; € G),Gz2,...,G, to 
remove weakly connected nodes. 

We cluster P2P nodes by using the Sybillnfer [21] 
framework. Sybillnfer is a technique to detect Sybil 
identities in a social network graph; a key feature of 
Sybillnfer is a sampling strategy to identify a good parti- 
tion out of an extremely large space of possibilities (2" ). 
However, the detection algorithm used in Sybillnfer re- 
lies on the existence of a small cut between the honest 
social network and the Sybil subgraph, and is thus not 
directly applicable to our setting. Next, we present a 
modified SybilInfer algorithm that is able to detect P2P 
nodes. 

1. Generation of Traces : The first step of the clus- 
tering is the the generation of a set of random walks on 
the input graph. The walks are generated by perform- 
ing a number n of random walks, starting at each node in 
the graph. A special probability transition matrix 1s used, 
defined as follows: 


l 


ws 


(5) 


Pi, = ie +) ifi— jis an edge inG 
0 otherwise 

This choice of transition probabilities ensures that the 

stationary distribution of the random walk is uniform 

over all vertices. The length of the random walk is 

O(log |V|), while the number of random walks per node 





USENIX Association 


(denoted by n), is a tunable parameter of the system. 
Only the start vertex and end vertex of each random walk 
are used by the algorithm, and this set of vertex pairs is 
called the traces, denoted by T. 

2. A probabilistic model for P2P nodes: At the heart 
of our detection algorithm lies a model that assigns a 
probability to each subset of nodes of being P2P nodes. 
Consider any cut X C V of nodes in the graph. We wish 
to compute the probability that the set of nodes X are all 
P2P nodes, given our set of traces T, i.e. P(X = P2P|T). 
Through the application of Bayes theorem, we have an 
expression of this probability: 


P(T|X = P2P)- P(X = P2P) 


P(X = P2P|T) = ZPD) 


(6) 
Note that we can treat P(T) as a normalization con- 
stant Z, as it does not change with the choice of X. The 
prior probability P(X = P2P) can be used to encode any 
further knowledge about P2P nodes (using honeynets), or 
can simply be set uniformly over all possible cuts. Our 
key theoretical task here is the computation of the proba- 
bility P(7 |X = P2P), since given this probability, we can 
compute P(X = P2P|T) using the Bayes theorem. 

Our intuition in proposing a model for P(T|X = P2P) 
is that for short random walks, the state probability mass 
for peer-to-peer nodes quickly approaches the station- 
ary distribution. Recall that the stationary distribution of 
our special random walks is uniform, and thus, the state 
probability mass for peer-to-peer nodes should be homo- 
geneous. We can classify the random walks in the trace T 
into two categories: random walks that end in the set X, 
and random walks that end in the set X (complementary 
set of nodes). 

Using our intuition that for short random walks, the 
state probability mass associated with peer-to-peer nodes 
is homogeneous, we assign a uniform probability to all 
walks ending in the set X. On the other hand, we make 
no assumptions about random walks ending in the set X 
(in contrast to the original SybilInfer algorithm). Thus, 


P(T|X = P2P) =MyerP(w|X =P2P), (7) 


where w denotes a random walk in the trace. Now if the 
walk w ends in vertex a in X, then we have that 


| | 
P(w|X = P2P)=) 7 xy 
vex 





(8) 


where N,, denotes the number of random walks ending in 
vertex v. Observe that this probability is the same for all 
vertices in X. On the other hand, if the walk w ends in 
vertex a in X, then we have that 


N. 


P(w|X = P2P) = IV) (9) 
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3. Metropolis-Hastings Sampling: Using the proba- 
bilistic model for P2P nodes, we have been able to com- 
pute the the probability P(X = P2P|T) up to a multi- 
plicative constant Z. However, computing Z is difficult 
since it involves enumeration over all subsets X of the 
graph. Thus, instead of directly calculating this prob- 
ability for any configuration of nodes X, we will sam- 
ple configurations X; following this distribution. We use 
the Metropolis-Hastings algorithm [34] to compute a set 
of samples X; ~ P(X|T). Given a set of samples S, we 
can compute marginal probabilities of nodes being P2P 
nodes as follows: 


Liesl (i € Xj) 


Pli is P2P| = 5 


(10) 
where /(i € X;) is an indicator random variable taking 
value | if node 7 is in the P2P sample X;, and value 0 oth- 
erwise. Finally, we can use a threshold on the marginal 
probabilities (set to 0.5) to partition the set of nodes into 
fast-mixing and slow-mixing components. 


3.4 Validation 


We note that a general graph may be composed of mul- 
tiple subgraphs having different mixing characteristics. 
However, our modified Sybillnfer based clustering ap- 
proach only partitions the graph into two subgraphs. This 
means we may have to use multiple iterations of the mod- 
ified SybilInfer based clustering algorithm to get to the 
desired fastest mixing subgraph. This raises an impor- 
tant question - what is the termination condition for the 
iteration. In other words, we need a validation test to 
establish that we have obtained the fast-mixing P2P sub- 
graph that we were trying to detect. Next, we propose 
a set of validation tests: if all of the tests are true, the 
iteration is terminated. 


e Graph Conductance test: It has been shown [62] 
that the presence of a small cut in a graph results 
in a slow mixing time and that a fast-mixing time 
implies the absence of small cuts. To formalize the 
notion of a small cut, we use the measure of graph 
conductance (®,,) [43] between cuts (X, X), defined 
as 


Yvcx Lygx U(X) Pry 
1(X) 

Since peer-to-peer networks are fast mixing, their 
graph conductance should be high (they do not have 
a small cut). Thus we can prevent further parti- 
tioning of a fast-mixing subgraph by testing that the 
graph conductance between the cuts is high. 

@ g\" ) entropy comparison test: Random walks on 
structured homogeneous P2P graphs are character- 
ized by high entropy state probability distributions. 


Dy = 


19th USENIX Security Symposium 99 


100 


This means that on a graph with n nodes, a random 
walk of length t © Jog|n| results in g\” =1/n. In 


l 
this sense they are theoretically optimal. We com- 
pute the relative entropy of the state probability dis- 
tribution in graph G(V, £) versus its theoretical op- 
timal equivalent graph G’. For this we use the 
Kullback-Leibler (KL) divergence measure [45] to 


calculate the relative entropy between gc and dcr: 


Fo = Leder (#) log 420 


then the mixing rates of G and G’ are compara- 
ble. This step can be computed in O(|V]) time and 
O(|V|) space. 

e Degree-homogeneity test: The entropy comparison 
test above does not rule out fast-mixing heteroge- 
neous graphs such as a star topology. However since 
structured P2P graphs have relatively homogeneous 
degree distributions (by definition), we need an ad- 
ditional test to measure the dispersion of degree val- 
ues. In our study, we measured the coefficient of 
variation of the degree distribution of G, defined as 
the ratio of standard deviation and mean: cg = O/u. 
cg will be O for a fully homogeneous degree dis- 
tribution. This metric can also be computed within 
O(|V|) time and space. 





When Fg is close to zero 


4 Privacy Preserving Graph Algorithms 


In general, ISPs treat the monitoring data they collect 
from their own networks as confidential, since it can re- 
veal proprietary information about the network config- 
uration, performance, and business relationships. Thus, 
they may be reluctant to share the pieces of the commu- 
nication graph they collect with other ISPs, presenting a 
barrier to deploying our algorithms. In this section, we 
present privacy-preserving algorithms for performing the 
computations necessary for our botnet detection. Funda- 
mentally, these algorithms support the task of performing 
a random walk across a distributed graph. 


4.1 Establishng a Common Identifier 
Space 


Our algorithms are expressed in terms of a graph G = 
(V,E), where the vertices are Internet hosts and edges 
are connections between them. This graph is assembled 
from m subgraphs belonging to m ASes, G; = (V;,F;) 
such that G = 2, G;. To simplify computations, we 
would like to generate an index mapping / : Ziy; — V. 
We base our approach on private set intersection pro- 
tocols. In particular, Jarecki and Liu have shown how 
to use Oblivious Pseudo-Random Functions (OPRFS) 
to perform private set intersection in linear time, 1.e., 
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O(|Vi| + |V;|). [37]. The basic approach consists of hav- 
ing a server pick a PRF f;(x), with a secret k. The 
server then evaluates S = { f;,(s;)} for all points within 
the server’s set and sends it to the client. The client then, 
together with the server, evaluates the PRF obliviously 
on all c; for its own set; 1.e, the client learns C = { f;(c;) } 
without learning k, whereas the server learns nothing ex- 
cept |C|. The client can then compute C US and thus find 
the intersection. 

We extend this approach to our problem as follows: we 
pick one AS to act as the server, and the rest as clients. 
Each client uses OPRF to compute f;,(V;). The server 
then generates an ordered list of f,(V,) and sends it to the 
second AS. The second AS finds f. (Vi) Of; (V2) and thus 
identifies the positions of its nodes in the vector. It then 
appends f;(V2) f<(V1) to the list and sends the resulting 
list f;(Vi UV2) to the next AS. This process continues 
until the last AS is reached, who then reports |V| to all of 
the others. Each AS can then compute J for any node v 
in its subgraph by finding the corresponding position of 
fx(v) in the list it saw. 

Next, the ASes needs to eliminate duplicate edges. A 
similar algorithm can be used here, with each ISP drop- 
ping from its observations any edge that was also ob- 
served by another ISP that comes earlier in the list. Al- 
ternatively, routing information can be used to determine 
which edges might be observed by which other AS and 
perform a pairwise set intersection including only those 
nodes. 

Finally, to perform random walk, each AS needs to 
learn the degree of each node. Since we eliminated du- 
plicated edges, d(v) = Y°"_, di(v), where dj(v) is the de- 
gree of node v in G;. The sum can be computed by a 
standard privacy-preserving protocol, which is an exten- 
sion of Chaum’s dining cryptographer’s protocol [13]. 
Each AS i creates m random shares 5 € Z, such that 


id oe = d;(v) mod / (where / is chosen such that / > 


max,d(v)). Each share 5. is sent to AS j. After all 
shares have been distributed, each AS computes s; = 


4 {J ) mod J and broadcasts it to all the other ASes. 
Then d(v) =", s; mod I. This protocol is information- 
theoretically secure: any set of malicious ASes S' only 
learns the value d(v) — )' jesdi(v). The protocol can be 
executed in parallel for all nodes v to learn all node de- 


grees. 


4.2 Random Walk 


We perform a random walk by using matrix operations. 
In particular, given a transition matrix T and an initial 
state vector V, we can compute TV, the state vector after a 
single random walk step. Our basic approach is to create 
matrices 7; such that )°"" , 7; = T. We can then compute 
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7;V in a distributed fashion and compute the final sum at 
the end. 

To construct 7;, an AS will set the value (7;);,% to be 
1/d(v;) for each edge (j,k) € E; (after duplicate edges 
have been removed). Note that this transition matrix is 
Sparse; it can be represented by N linked lists of non- 
zero values (7;);,x. Thus, the storage cost is O(|Ej|) << 
O(\V;|). 

To protect privacy, we use Paillier encryption [55] to 
perform computation on an encrypted vector E'(¥). Pail- 
lier encryption supports a homomorphism that allows 
one to compute E(x) @ E(y) = E(x+,y); it also allows 
the multiplication by a constant: c® E(x) = E(cx). This, 
given an encrypted vector E(¥) and a known matrix 7;, it 
is possible to compute F'(7;¥). 

Damgard and Jurik [20] showed an efficient dis- 
tributed key generation mechanism for Paillier that al- 
lows the creation of a public key K such that no indi- 
vidual AS knows the private key, but together, they can 
decrypt the value. In the full protocol, one AS creates an 
encrypted vector E(¥) that represents the initial state of 
the random walk. This vector is sent to each AS, who 
then computes E(7;v). The ASes sum up the individual 
results to obtain E()”, 7;)¥) = E(TV). This process can 
be iterated to obtain E(T*¥). Finally, the ASes jointly 
decrypt the result to obtain T*¥. 

Note that Paillier operates over members Z,, where n 
is the product of two large primes. However, the vector 
v and the transition matrices 7; contain fractional values. 
To address this, we used fixed-point representation, stor- 
ing |x x 2°| (equivalently, (x—€) x 2°, where € < 2~°). 
Each multiplication results in changing the position of 
the fixed point, since: 





((x—€1) x 2°) ((y—€2) x 2°) = (xy) x 2° 


where €3 < 2~°+!. Therefore, we must ensure that Qke «< 
n, where k is the number of random walk steps. The 
maximal length random walk we use is 2log;|V|, where 
d is the average node degree, so k < 40, which gives us 
plenty of fixed-point precision to work with for a typical 
choice of n (1024 or 2048 bits).? 





4.3 Performance 


Although the base privacy-preserving protocols we pro- 
pose are efficient, due to the large data sizes, the oper- 
ations still take a significant amount of processing time. 


*Note that the multiplication of probabilities might result in values 
that are extremely small; however, the number of digits after the fixed 
point correspondingly increases after each multiplication, preventing 
loss of precision. 

>The CPU time is estimated based on experiments on different hard- 
ware; however, these numbers are intended to provide an order-of- 
magnitude estimate of the costs. 
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Table 1: Privacy Preserving Operations 


Step CPU time AS1 (s) ? 
1. Determine common identifiers 1 020 000 
2. Eliminate duplicate edges 8 160 000 
3. Compute node degrees (no crypto) 
4. Random walk (20 steps) 8 000 000 


We estimate the actual processing costs and bandwidth 
overhead, using some approximate parameters. In par- 
ticular, we consider a topology of 30 million hosts, with 
an average degree of 20 per node.* 

The running time of the intersections to compute a 
common representation is linear in |V;| +|V;|. We expect 
that |V;| < |V|, but in the worst case, each ISP sees all of 
the nodes. Projecting linearly, we expect to spend about 
30 000s on an intersection between two ISPs. Most ASes 
must perform only one intersection, but the first AS 1s in- 
volved in m— | intersections. We expect m to be around 
35, based on our analysis of visibility of bot paths by 
tier-1 ISPs (Section 5.1). An important feature of the al- 
gorithm is that each ISP other than the first need only 
perform as many OPREF evaluations as it has nodes in its 
observation table, thus smaller ISPs with fewer resources 
need to perform correspondingly less work. We therefore 
suggest that the largest contributing ISP be chosen as the 
server. De Cristofaro and Tsudik suggest an efficiency 
improvement for Jaercki and Liu’s algorithm [18]; they 
find that the server computation for 1 000 client values is 
less than 400ms. Projecting linearly, we expect that the 
server load per client should be 12 O00 seconds. 

The next series of set intersections involve edge sets. 
The worst-case scenario for this computation assumes 
that all ASes see all edges, although, of course, this is 
unlikely (and would mean that the participation of some 
ASes is redundant). The load on the central server is 
(0.4s/1000) - 600000000 - 34 = 8 160000s 

A step of the random walk requires O(|E|) homomor- 
phic multiplications and additions of encrypted values. 
Our measurements with libpaillier> show that the 
multiplications are two orders of magnitude slower than 
additions. We were able to perform approx. 1500 mul- 
tiplications per second using a 2048-bit modulus. This 
means that a single step would take 400 000s of compu- 
tation. 

We summarize the costs of the computation in Ta- 
ble 1. It is important to note that all of the operations 
are trivially parallelizable and thus can be computed on a 
moderately-sized cluster of commodity machines. Addi- 
tionally, the table represents the costs of an initial com- 
putation; updated results can be computed by operating 





4The choice of topology size and the average node degree is moti- 
vated from our experimental setting in Section 5. 
Shttp://acsc.cs.utexas.edu/libpaillier/ 
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only on the deltas of the observations, which we expect 
to be significantly smaller. 


5 Results 


To evaluate performance of our design, we evaluate it in 
the context of real Internet traffic traces. Ideally, to eval- 
uate our design, we would like to have a list of all bots 
in the Internet, along with which logs of packets flowing 
between them, in addition to packet traces between non- 
botnet hosts. Unfortunately, acquiring data this exten- 
sive is very hard, due to the (understandable) reluctance 
of ISPs to share their internal traffic, and the difficulty in 
gaining ground truth on which hosts are part of a botnet. 


To address this, we apply our approach to synthetic 
traces. In particular, we construct a topology containing 
a botnet communication graph, and embed it within a 
communication graph corresponding to background traf- 
fic. To improve realism, we build the background traf- 
fic communication graph by using real traffic collected 
from Netflow logs from the IP backbone of the Abi- 
lene Internet2 ISP. For our analysis, we consider a full 
day’s trace collected on 22 October 2009. Since Abi- 
lene’s NetFlow traces are aggregated into /24-sized sub- 
nets for anonymity, we perform the same aggregation for 
the botnet graph, and collect experimental results over 
the resulting subnet-level communication graph (we ex- 
pect if our design were deployed in practice with access 
to per-host information, its performance would improve 
due to increased visibility). To investigate sensitivity of 
our results to this methodology and data set, we also use 
packet-level traces collected by CAIDA on OC192 Inter- 
net backbone links [5] on 11 January 2009. To construct 
the botnet graph, we select a random subset of nodes in 
the background communication graph to be botnet nodes, 
and synthetically add links between them correspond- 
ing to a particular structured overlay topology. We then 
pass the combined graph as input to our algorithm. By 
keeping track of which nodes are bots (this information 
is not passed to our algorithm), we can acquire “ground 
truth” to measure performance. To investigate sensitivity 
of our techniques to the particular overlay structure, we 
consider several alternative structured overlays, includ- 
ing (a) Chord, (b) de Bruijn, (c) Kademlia, and (d) the 
“robust ring” topology described in [39]. The remainder 
of this section contains results from running our algo- 
rithms over the joined botnet and Internet communica- 
tion graphs, and measuring the ability to separate out the 
two from each other. 


Before we proceed to the results, we first illustrate our 
inference algorithm with an example run. 
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Figure 2: The filtered limit distribution (s;) after cluster- 
ing 


5.1 Algorithm Example 


Let us consider a specific application of our algorithm 
on a synthetically-generated de Bruijn [41] peer-to-peer 
graph embedded within a communication graph sampled 
from the Internet (using NetFlow traces from the Abi- 
lene Internet2 ISP). The Abilene communication graph 
Gp contains |Vp| = 104426 nodes. We then generated a 
de Bruijn graph G, of 10000 nodes, with m = 10 out- 
going links and n = 4 dimensions (10% of |V|). Gp is 
then embedded in Gp by mapping a node in Gp into a 
node in Gp: for every node i € Vg we select a node j € Vp 
uniformly at random between 1 and |Vp| without replace- 
ment, and add the corresponding edges in Ep to Ep. The 
resulting graph is G(V,E’) with N = |V| = 104426 nodes 
and |E| = 647053 edges. The goal of our detection tech- 
nique is to extract G, from Gp as accurately as possible. 

First, we apply the pre-filtering step: we carry out a 
short random walk starting from every node with proba- 
bility 1/N to obtain g" ) on which the transformation fil- 
ter of Equation 3 is applied to obtain s. We used a damp- 
ening constant of r = 100 to undermine the influence of 
hub nodes on the random walk process. The data points 
in s corresponding to each of the partitions returned by 
k-means clustering is shown in Figure 2. 

In the example we consider here, applying the k- 
means algorithm gives us ten sets of potential P2P can- 
didates. In a completely unsupervised setting, we would 
need to run the modified SybilInfer algorithm on each of 
the candidate sets. However we expect that the analysis 
can simply be focused on the candidate set containing the 
set of honey-net nodes. Thus, let us consider the graph 
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Table 2: Termination Conditions 


Condition Final iter. Other iters. 
Conductance 0.9 < 0.5 
KL-divergence 0.1 > 0.45 
Entropy 0.97 < 0.64 
Coeff. of variation < | > 4.6 


nodes corresponding to the fourth cluster (colored in yel- 
low). The cluster size is 17576 nodes. 

Next, we recursively apply the modified SybilInfer 
partitioning algorithm to this cluster. After three itera- 
tions of the Sybillnfer partitioning algorithm, we obtain 
a subgraph of size 10143 nodes, containing 9905 P2P 
nodes, and 238 other nodes. At this stage, our set of val- 
idation conditions indicates that the sub-graph is indeed 
fast mixing, and we stop the recursion. Table 2 shows the 
values of the validation metrics on the final subgraph and 
the previous graphs. There is a significant gap, making it 
easy to select a threshold value. 

To evaluate performance, we are concerned with the 
false positive rate (the fraction of non-bot nodes that are 
detected as bots) and the false negative rate (the frac- 
tion of bot nodes that are not detected). These results 
are shown in Tables 3(a) and 3(b). The experimental 
methodology and parameters used were the same as in 
the above example. All results are averaged over five 
random seeds. Overall, we found that BotGrep was able 
to detect 93-99% of bots over a variety of topologies and 
workloads. In particular, we observed several key results: 


Effect of botnet topology: To study applicability of 
our approach to different botnet topologies, we consider 
Kademlia [50], Chord [70], and de Bruijn graphs. In ad- 
dition, we also consider the LEET-Chord topology [39], 
a recently proposed overlay topology that aims to be dif- 
ficult to detect (cannot be reliably detected with exist- 
ing traffic dispersion graph techniques). Overall, we find 
performance to be fairly stable across multiple kinds of 
botnet topologies, with detection rates higher than 95%. 
In addition, BotGrep is able to achieve a false positive 
rate of less than 0.42% on the harder-to-detect LEET- 
Chord topology. While our approach is not perfectly ac- 
curate, we envision it may be of use when coupled with 
other detection strategies (e.g., previous work on botnet 
detection [38, 36], or if used to signal “hints” to net- 
work operators regarding which hosts may be infected. 
Furthermore, while the LEET-Chord topology is harder 
to detect, this comes at a tradeoff with less resilience 
to failure. To study the robustness of the LEET-Chord 
topology, Figure 3 shows the robustness of Chord and 
LEET-Chord by randomly removing varying percentages 
of nodes. We observed that LEET-Chord is much less re- 
silient to node failures (or active attacks) as compared 
with Chord. This trade-off between stealthiness of the 
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Figure 3: Robustness of Chord and LEET-Chord with 65,536 
nodes. We also consider an alternative LEET-Chord-Iter, where 
routing proceeds as in regular LEET-Chord, but when the des- 
tination is outside the node’s cluster, and when all long range 
links are failed, it greedily forwards the packet iteratively to 
next clockwise cluster. 


topology and its resilience is not surprising, since a com- 
mon indicator of resilience is the bisection bandwidth, 
and Sinclair [66] has shown that the bisection bandwidth 
is bounded by the mixing time of the topology. Thus, it 
is likely that the use of stealthy slow mixing topologies 
to escape detection via BotGrep would adversely effect 
the resilience of the botnet. 


Effect of botnet graph size: Next, we vary the size 
of the embedded botnet. We do this to investigate perfor- 
mance as a function of botnet size, for example, to evalu- 
ate whether BotGrep can efficiently detect small botnets 
(e.g., bots in early stages of deployment, which may have 
greater chance of containment) and large-scale botnets 
(which may pose significant threats due to their size and 
large topological coverage). We perform this experiment 
by keeping the size of the background traffic graph con- 
stant, and generating synthetic botnet topologies of vary- 
ing sizes (between 100 and 100,000 bots). The degree 
of bot nodes in the case of Chord and Kademlia depend 
on the size of the topology (log), while for de Bruijn, 
we used a constant node degree of 10. Overall, we found 
that as the size of the bot graph increases, performance 
degrades, but only by a small amount. For example, in 
Table 3(a), with the fully visible de Bruijn topology, for 
100 nodes the false positive rate is zero, while for 10,000 
nodes the rate becomes 0.12%. 


Effect of background graph size: One concern is that 
BotGrep may perform less accurately with larger back- 
ground graphs, as it may become easier for the botnet 
structure to “hide” in the increasing number of links in 
the graph. To evaluate sensitivity of performance to 
scale, we vary the size of the background communication 
graph, by evaluating over both the Abilene and CAIDA 
dataset (104,426 and 3,839,936 nodes, respectively). To 
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(a) Abilene 


Topology \Ve| % FP % FN % Detected 
de Bruijn 100 0.00 2.00 98.00 
1000 0.01 2.40 97.60 
10000 0.12 2,30 97.65 
Kademlia 100 0.00 3.20 97.80 
1000 0.01 2.48 98.52 
10000 0.10 2ZAZ 97.88 
Chord 100 0.00 3.00 97.00 
1000 0.01 22 97.68 
10000 0.08 1.94 98.06 
LEET-Chord 100 0.00 3.00 97.00 
1000 0.03 1.60 98.40 
10000 0.42 1.00 99.00 


(b) CAIDA 
Topology \Va| % FP % FN % Detected 
de Bruijn 1000 0.00 1.80 98.20 
10000 0.01 0.93 99.07 
100000 0.09 0.67 99.33 
Kademlia 1000 0.00 2.10 97.90 
10000 0.01 0.80 99.20 
100000 0.19 0.17 99.83 
Chord 1000 0.00 2.20 97.80 
10000 0.01 0.48 99.52 
100000 0.06 0.46 99.54 
LEET-Chord 1000 0.00 0.40 99.60 
10000 0.02 0.48 99.52 


Table 3: Detection and error rates of inference for (a) Abilene and (b) CAIDA communication graphs 


(a) CAIDA 30M 


(b) Leveraging Honeynets - CAIDA 


Topology WA % FP % FN % Detected Topology WA % FP % FN % Detected 
de Bruijn 100000 0.01 0.8 99.20 de Bruijn 100000 0.04 0.8 99.20 
Kademlia 100000 0.01 0.4 99.60 Kademlia 100000 0.05 0.4 99.60 
Chord 100000 0.01 0.4 99.60 Chord 100000 0.04 0.4 99.60 


Table 4: Detection and error rates of inference (a) for CAIDA 30M (b) when leveraging Honeynets for CAIDA. 


get a rough sense of performance on much larger back- 
ground graphs, we also build a “scaled up” version of 
the CAIDA graph containing 30 million hosts while re- 
taining the statistical properties of the CAIDA graph. To 
scale up the CAIDA graph G, by a factor of k, we make 
k copies of G,, namely G; ...G, with vertex sets Vj... Vx 
and edge sets FE, ...E,. Note that for each edge (p,q) in 
E,, we have a corresponding edge in each copy G, ... Gx, 
we refer to these as (~p1,41)--- (Px, gx). We then compute 
the graph disjoint union over them as Gs(Vs, E's) where 
Ve = (Vi UVo---UVY, and Es = Ej U Ey---UEx). Next, 
we randomly select a fraction of links from Es to ob- 
tain a set of edges EF, that we shall rewire. As a heuris- 
tic, we set the number of links selected for rewiring to 
|E,| = k,/Nlog(N) where N is the number of nodes in 
the CAIDA graph G,. For each edge (p,q) in E, we 
wish to rewire, we choose two random numbers a and 
b (1 <a,b <k) and rewire edges (pg,da) and (pp, gp) to 
(Pa, qb) and (pp, Ga) such that dp, = dp, and dz, = dg). 
This edge rewiring ensures that (a) the degree of all 
four nodes Pg,ga,Pp and gp remains unchanged, (b) the 
joint degree distribution P(d;,d2) — the probability that 
an edge connects d; and d> degree nodes remains un- 
changed, and (c) P(d,,d2,...d;) remains unchanged as 
well, where / is the number of unique degree values that 
nodes in G, can take. 


Overall, we found that BotGrep scales well with net- 
work size, with performance remaining stable as network 
size increases. For example, in the CAIDA dataset with 
a background graph of size 3.8 million hosts, the false 
positive rate for the de Bruijn topology of size 100000 
is 0.09% (shown in Table 3b), while for the scaled up 
30 million node CAIDA topology, this rate is 0.01 (Ta- 
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Figure 4: Number of visible botnet links, as a function of num- 
ber of most-affected ASes contributing views. 


ble 4(a)). Observe that the false positive rate has de- 
creased by a factor of 9, which 1s approximately equal 
to the scale up factor between the two topologies, indi- 
cating the the actual number of false positives remains 
the same. This indicates that the number of false posi- 
tives depend on botnet size and not the background graph 
SiZe. 


Effect of reduced visibility: In the experiments we 
have performed so far, the embedded structured graph G, 
is present in its entirety. However, just as Gp is obtained 
by sampling Internet or enterprise traffic, only a subset of 
botnet control traffic will actually be available to us. It is 
therefore important to evaluate how well our algorithms 
work with graphs where only a fraction of the structured 
subgraph edges are known. To study this, we evalu- 
ate performance of our scheme when deployed at only 
a subset of ISPs in the Internet. To do this, we collected 
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(a) Abilene 


Topology WA %o FP %o FN % Detected 
de Bruijn 100 0.00 3.00 97.00 
1000 0.02 2.80 97.20 
10000 OAT 3.31 96.69 
Kademlia 100 0.00 3370 96.25 
1000 0.01 2.90 97.10 
10000 0.19 2.07 97.93 
Chord 100 0.00 9.00 91.00 
1000 0.02 3.50 96.50 
10000 0.13 2.54 97.46 
LEET-Chord 100 0.00 6.00 94.00 
1000 0.06 2.70 97.30 
10000 0.58 1.80 98.20 


(b) CAIDA 
Topology \Va| % FP % FN % Detected 
de Bruijn 1000 0.00 2.70 97.30 
10000 0.00 4.22 95.78 
100000 0.12 1.74 98.26 
Kademlia 1000 0.00 0.50 99.50 
10000 0.01 0.30 99.70 
100000 0.09 0.95 99.47 
Chord 1000 0.00 3.40 96.60 
10000 0.01 0.65 99.35 
100000 0.06 5.36 94.64 
LEET-Chord 1000 0.01 0.20 99.80 
10000 0.02 1.09 98.91 


Table 5: Results if only Tier-1 ISPs contribute views, for (a) Abilene and (b) CAIDA 


roughly 4,000 Storm botnet IP addresses from Botlab [1] 
(botlab-storm), and measured what fraction of inter-bot 
paths were visible from tier-1 ISPs. From an analysis of 
the Internet AS-level topology [63], we find that 60% 
of inter-bot paths traverse tier-1 ISPs. We found that 
if the most-affected ASes cooperate—the ASes with the 
largest number of bots—this number increased to 89%). 
Figure 4 shows this result in more detail. Here, we vary 
the number of ASes cooperating to contribute views (as- 
suming the most-affected ASes contribute views first), 
plotting the number of visible inter-bot links. We repeat 
the experiment also for the Kraken botnet trace from [1] 
(kraken-botlab), as well as a packet-level trace from the 
Storm botnet (storm-trace). We find that if only the 5 
most-affected ASes contribute views, 57% of Storm links 
and 65% of Kraken links were visible. 


We therefore removed 40% of links from our botnet 
graphs (Table 5a and Table 5b). While the false-negative 
rate increases, our approach still detects over 90% of bot- 
net hosts with high reliability (the false positive rate for 
the hard to detect LEET-Chord topology still remains 
less than 0.58%). Disabling or removing such a large 
fraction of nodes will lead to certain loss of operational 
capability. 


Leveraging Honeynets: We shall now present an exten- 
sion to our inference algorithm that leverages the knowl- 
edge of a few known bot nodes. This extension considers 
random walks starting only from the honeynet nodes to 
obtain a set of candidate P2P nodes in the prefiltering 
stage. Using this extension, we find that there is a sig- 
nificant gain in terms of reducing the false positives, as 
well as speeding up the efficiency of the protocol. As 
Table 4b shows, the false positive rate for the Kademlia 
topology has been reduced by a factor of 4 as compared 
to corresponding value in Table 3b. Furthermore, only a 
single iteration of the modified Sybillnfer algorithm was 
required to obtain the final subgraphs, providing a signif- 
icant gain in efficiency. 


Effect of inference algorithm: For comparison pur- 
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poses, we also consider several graph partitioning algo- 
rithms that have been proposed in the literature. While 
these techniques were not intended to scale up to the 
large data sets we consider here, we can compare against 
them on smaller data sets to get a sense of how BotGrep 
compares against these approaches. In particular, several 
algorithms for community detection (detecting groups of 
nodes in a network with dense internal connections) have 
been proposed. Work in this space mainly focuses on hi- 
erarchical clustering methods. Work in this space can be 
classified as following two categories, and for our evalu- 
ation we implement two representative algorithms from 
each category: 

Edge importance based community structure detec- 
tion iteratively removes the edges with the highest im- 
portance, which can be defined in different ways. Gir- 
van and Newman [25] defined edge importance by its 
shortest path betweenness. The idea 1s that the edge with 
higher betweenness is typically responsible for connect- 
ing nodes from different communities. In [22], informa- 
tion centrality has been proposed to measure the edge 
importance. The information centrality of an edge is de- 
fined as the relative network efficiency [46] drop caused 
by the removal of that. The time complexity of algorithm 
in [25] and [22] are O(|V|*) and O(|E|> x V), respec- 
tively. 

The spectral-based approach detects communities by 
optimizing the modularity (a benefit function measures 
community structure [52] over possible network divi- 
sions. In [53], the communities are detected by calcu- 
lating the eigenvector of the modularity matrix. It takes 
O(\|E|+|V |?) time to separating each community. More- 
over, Clauset et al. [14] proposed a hierarchical agglom- 
eration algorithm for community detecting. The pro- 
posed greedy algorithm adopts more sophisticated data 
structures to reduce the computation time of modularity 
calculation. The time complexity is O(|/E|+ |V|log, |V|) 
in average. 

As the time complexity of above algorithms is not ac- 
ceptable for computing large-scale networks, here we 


19th USENIX Security Symposium — 105 


106 


Topology BotGrep Fast Greedy Girvan-Newman Modularity 


Modularity 

de Bruijn 0.78/2.55 = 14.43/7.65 
Chord 0.77/7.15 7.58/10.13 
Kademlia 0.92/7.00 14.66/33.80 


Betweenness Eigenvector 


19.73/15.31  0.92/43.88 
6.05/19.50 4.24 /20.19 
18.06/4.75  5.70/48.70 


Table 6: 2k Abilene Results (% FP /% FN) 


consider a small-scale scenario for performance evalua- 
tion. We extract subgraphs from full Abilene data by per- 
forming a Breadth-First-Search (BFS) starting at a ran- 
domly selected node, in which the overall visited nodes 
are limited by a size of 2000. Results from our com- 
parison are shown in Table 6. The information central- 
ity algorithm took more than one month to run for just 
one iteration on this 2000-node graph, and was hence 
excluded from further analysis (we tested information 
centrality on smaller 50-node graphs, and found perfor- 
mance comparable to the Girvan and Newman Between- 
ness algorithm). Overall, we found that our approach 
outperformed these approaches. For example, on the 
Chord topology, BotGrep’s false positive rate was 0.77%, 
while false positive rates for the other approaches ranged 
from 4.24-7.58%. The performance of BotGrep is less on 
this scaled down 2000-node topology as compared to the 
earlier Abilene and CAIDA datasets, because our method 
of generating the scaled-down 2000 node graph selected 
the densely connected core of the graph, which is fast- 
mixing, while on more realistic graphs, it is easier for 
BotGrep to distinguish the fast-mixing botnet topology 
from the rest of the non-fast-mixing background graph. 

Moreover, we found that run-time was a significant 
limiting factor in using these alternate approaches. For 
example, the Girvan-Newman Betweenness Algorithm 
took 2.5 hours to run on a graph containing 2000 nodes 
(in all cases, BotGrep runs in under 10.4 seconds on a 
Core2 Duo 2.83GHz machine with 4GB RAM using a 
single core). While these traditional techniques were not 
intended to scale to the large data sets we consider here, 
they may be appropriate for localizing smaller botnets in 
contained environments (e.g., within a single Honeynet, 
or the part of a botnet contained within an enterprise net- 
work). Since these techniques leverage different features 
of the inputs, they are synergistic with our approach, and 
may be used in conjunction with our technique to im- 
prove performance. 


6 Discussion 


As we have demonstrated, analysis of core Internet traf- 
fic can be effective at identifying nodes and communi- 
cation links of structured overlay networks. However, 
many challenges remain to turn our approach into a full- 
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scale detection mechanism. 


Misuse Detection: It is easy to see that other forms of 
P2P activity, such as file sharing networks, will also be 
identified by our techniques. While there is some benefit 
to being able to identify such traffic as well, it requires a 
dramatically different response than botnets and so it is 
important to distinguish the two. We believe that funda- 
mentally, our mechanisms need to be integrated with de- 
tection mechanisms at the edge that identify suspicious 
behavior. Also, multiple intrusion detection approaches 
can reinforce each other and provide more accurate re- 
sults [75, 67, 30]; e.g., misbehaving hosts that follow a 
similar misuse pattern and at the same time are detected 
to be part of the same botnet communication graph may 
be precisely labeled as a botnet, even if each individual 
misbehavior detection is not sufficient to provide a high- 
confidence categorization. 

A concrete example of how misuse detection may 
work is the following: we randomly sample nodes from 
the suspect P2P network and compute the likelihood of 
the sampled nodes being malicious, based on inputs from 
honeynets, spam blacklists etc. If we can identify a statis- 
tically significant difference of the rates of misuse, then 
we can assume that membership in the P2P network is 
correlated with misuse and we should label it as a P2P 
botnet. Note that, given the availability of large sample 
sizes, even a small difference in the rates will be statisti- 
cally significant, so this approach will be successful even 
if misuse detection fails to identify the vast majority of 
the botnet nodes as malicious. 


Scale and cooperation: Our experiments show our de- 
sign can scale to large traffic volumes, and in the pres- 
ence of partial observations. However, several practi- 
cal issues remain. First, large ISPs tend to use sam- 
pled data analysis to monitor their networks. This can 
miss low-volume control communications used by botnet 
networks. New counter architectures or programmable 
monitoring techniques should be used to collect suffi- 
cient statistics to run our algorithms [73]. Also, for best 
results multiple vantage points should contribute data to 
obtain a better overall perspective. 


Tradeoffs between structure and detection: The com- 
munication structure of botnet graphs plays an important 
role in their delay penalty, and how resilient they are to 
network failures. At the same time, our results indicate 
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that the structure of the communication graph has some 
effect on the ability to detect the botnet host from a col- 
lection of vantage points. As part of future work, we plan 
to study the tradeoff between resilience and the ability to 
avoid detection, and whether there exist fundamentally 
hard-to-detect botnet structures that are also resilient. 


Containing botnets: The ability to quickly localize 
structured network topologies may assist existing sys- 
tems that monitor network traffic to quickly localize and 
contain bot-infected hosts. When botnets are detected 
in edge networks, the relevant machines are taken of- 
fline. However, this may not always be easy with in- 
core detection; an interesting question is whether in-core 
filtering or distributed blacklisting can be an effective re- 
sponse strategy when edge cooperation is not possible. 
Another question we plan to address is whether there ex- 
ist responses that do not completely disconnect a node 
but mitigate its potential malicious activities, to be ef- 
fected when a node is identified as a botnet member, but 
with a low confidence. 


7 Related Work 


The increasing criticality of the botnet threat has led to 
vast amounts of work that attempt to localize them. We 
can classify this work into host based approaches and 
network based approaches. Host based approaches detect 
intrusions by analyzing information available on a sin- 
gle host. On the other hand, network based approaches 
detect botnets by analyzing incoming and outgoing host 
traffic. Hybrid approaches exist as well. BotGrep (our 
work) is a network based approach to botnet detection 
that uses graph theory to detect botnets. 

In the following section (Section 7.1) we review re- 
lated work on network based approaches and then de- 
scribe work on botnet detection using graph analysis 
(Section 7.2). 


7.1 Network based approaches 


Several pieces of work isolate bot-infected hosts by de- 
tecting the malicious traffic they send, which may be 
divided into schemes that analyze attack traffic, and 
schemes that analyze control traffic. 


Attack traffic: For example, network operators may 
look for sources of denial of service attacks, port scan- 
ning, spam, and other unwanted traffic as a likely bot. 
These works focus on the symptoms caused by the bot- 
nets instead of the networks themselves. Several works 
seek to exploit DNS usage patterns. Dagon et al. [19] 
studied the propagation rates of malware released at dif- 
ferent times by redirecting DNS traffic for bot domain 
names. Their use of DNS sinkholes is useful in mea- 
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suring new deployments of a known botnet. However, 
this approach requires a priori knowledge of botnet do- 
main names and negotiations with DNS operators and 
hence does not target scaling to networks where a bot- 
net can simply change domain names, have a large pool 
of C&C IP addresses and change the domain name gen- 
eration algorithm by remotely patching the bot. Subse- 
quently, Ramachandran et al. [61] use a graph based ap- 
proach to isolate spam botnets by analyzing the pattern 
of requests to DNS blacklists maintained by ISPs. They 
observed that legitimate email servers request blacklist 
lookups and are looked up by other email servers ac- 
cording to the timing pattern of email arrival, while bot- 
infected machines are a lot less likely to be looked up 
by legitimate email servers. However, DNS blacklists 
and phishing blacklists [65], while initially effective have 
are becoming increasingly ineffective [60] owing to the 
agility of the attackers. Much more recently, Villamar 
et al. [74] applied Bayesian methods to isolate central- 
ized botnets that use fast-flux to counter DNS blacklists, 
based on the similarity of their DNS traffic with a given 
corpus of known DNS botnet traces. Further, in order 
to study bots, Honeypot techniques have been widely 
used by researchers. Cooke et al. [17] conducted several 
studies of botnet propagation and dynamics using Hon- 
eypots; Barford and Yegneswaran [8] collected bot sam- 
ples and carried out a detailed study on the source code 
of several families; finally, Freiling et al. [24] and Rajab 
et al. [59] carried out measurement studies using Honey- 
pots. Collins et al. [16] present a novel botnet detection 
approach based on the tendency of unclean networks to 
contain compromised hosts for extended periods of time 
and hence acting as a natural Honeypot for various bot- 
nets. However Honeypot-based approaches are limited 
by their ability to attract botnets that depend on human 
action for an infection to take place, an increasingly pop- 
ular aspect of the attack vector [51]. 


Control traffic: Another direction of work, is to local- 
ize botnets solely based on the control traffic they use to 
maintain their infrastructures. This line of work can be 
classified as traffic-signature based detection and statis- 
tical traffic analysis based detection. Techniques in the 
former category require traffic signatures to be developed 
for every botnet instance. This approach has been widely 
used in the detection of IRC-based botnets. Blinkley and 
Singh[10] combine IRC statistics and TCP work weight 
to generate signatures; Karasaridis et al. [44] present an 
algorithm to detect IRC C&C traffic signatures using 
Netflow records; Rishi [27] uses n-gram analysis to iden- 
tify botnet nickname patterns. The limitations of these 
approaches are analogous to the scalability issues faced 
by host-based detection techniques. In addition, such 
signatures may not exist for P2P botnets. In the latter 
category, several works [31, 72, 9, 49] suggest that bot- 
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nets can be detected by analyzing their flow character- 
istics. In all these approaches, the authors use a vari- 
ety of heuristics to characterize the network behavior of 
various applications and then apply clustering algorithms 
to isolate botnet traffic. These schemes assume that the 
statistical properties of bot traffic will be different from 
normal traffic because of synchronized or correlated be- 
havior between bots. While this behavior is currently 
somewhat characteristic of botnets, it can be easily mod- 
ified by botnet authors. As such it does not derive from 
the fundamental property of botnets. 

Other works use a hybrid approach such as Both- 
unter [30] which automates traffic-signature generation 
by searching for a series of flows that match the infec- 
tion life-cycle of a bot; BotMiner [29] combines packet 
statistics of C&C traffic with those of attack traffic and 
then applies clustering techniques to heuristically isolate 
botnet flows. TAMD [76] is another method that ex- 
ploits the spatial and temporal characteristics of botnet 
traffic that emerges from multiple systems within a van- 
tage point. They aggregate flows based on similarity of 
flow sizes and host configuration (such as OS platforms) 
and compare them with a historical baseline to detect in- 
fected hosts. 

Finally, there are also schemes that combine network- 
and host-based approaches. The work of Stinson et 
al. [69] attempts to discriminate between locally-initiated 
versus remotely-initiated actions by tracking data arriv- 
ing over the network being used as system call arguments 
using taint tracking methods. Following a similar ap- 
proach, Gummadi et al. [33] whitelist application traf- 
fic by identifying and attesting human-generated traffic 
from a host which allows an application server to se- 
lectively respond to service requests. Finally, John et 
al. [40] present a technique to defend against spam bot- 
nets by automating the generation of spam feeds by di- 
recting an incoming spam feed into a Honeynet, then 
downloading bots spreading through those messages and 
then using the outbound spam generated to create a bet- 
ter feed. While all the above are interesting approaches 
they again deal with the side-effects of botnets instead of 
tackling the problem in its entirety in a scalable manner. 


7.2 Graph-based approaches 


Several works [15, 36, 35, 78, 38] have previously ap- 
plied graph analysis to detect botnets. The technique of 
Collins and Reiter [15] detects anomalies induced in a 
graph of protocol specific flows by a botnet control traf- 
fic. They suggest that a botnet can be detected based on 
the observation that an attacker will increase the number 
of connected graph components due to a sudden growth 
of edges between unlikely neighboring nodes. While it 
depends on being able to accurately model valid network 
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growth, this is a powerful approach because it avoids de- 
pending on protocol semantics or packet statistics. How- 
ever this work only makes minimal use of spatial re- 
lationship information. Additionally, the need for his- 
torical record keeping makes it challenging in scenar- 
10s where the victim network is already infected when 
it seeks help and hasn’t stored past traffic data, while our 
scheme can be used to detect pre-existing botnets as well. 
[lhofotou et al. [36, 35] also exploit dynamicity of traffic 
graphs to classify network flows in order to detect P2P 
networks. It uses static (spatial) and dynamic (temporal) 
metrics centered on node and edge level metrics in addi- 
tion to the largest-connected-component-size as a graph 
level metric. Our scheme however starts from first princi- 
ples (searching for expanders) and uses the full extent of 
spatial relationships to discover P2P graphs including the 
joint degree distribution and the joint-joint degree distri- 
bution and so on. 

Of the many botnet detection and mitigation tech- 
niques mentioned above, most are rather ad-hoc and 
only apply to specific scenarios of centralized botnets 
such as IRC/HTTP/FTP botnets, although studies [28] 
indicate that the centralized model is giving way to the 
P2P model. Of the techniques that do address P2P bot- 
nets, detection is again dependent on specifics regarding 
control traffic ports, network behavior of certain types 
of botnets, reverse engineering botnet protocols and so 
on, which limits the applicability of these techniques. 
Generic schemes such as BotMiner [29] and TAMD [76] 
using behavior based clustering are better off but need 
access to extensive flow information which can have le- 
gal and privacy implications. It is also important to think 
about possible defenses that botmasters can apply, the 
cost of these defenses and and how they might affect the 
efficiency of detection. Shear and Nicol [64, 54] describe 
schemes to mask the statistical characteristics of real traf- 
fic by embedding it in synthetic, encrypted, cover traffic. 
The adoption of such schemes will only require minimal 
alterations to existing botnet architectures but can effec- 
tively defend against detection schemes that depend on 
packet level statistics including BotMiner and TAMD. 


$ Conclusion 


The ability to localize structured communication graphs 
within network traffic could be a significant step forward 
in identifying bots or traffic that violates network policy. 
As a first step in this direction, we proposed BotGrep, an 
inference algorithm that identifies botnet hosts and links 
within network traffic traces. BotGrep works by search- 
ing for structured topologies, and separating them from 
the background communication graph. We give an ar- 
chitecture for a BotGrep network deployment as well as 
a privacy-preserving extension to simplify deployment 
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across networks. While our techniques do not achieve 
perfect accuracy, they achieve a low enough false posi- 
tive rate to be of substantial use, especially when com- 
bined with complementary techniques. There are sev- 
eral avenues of future work. First, performance of our 
approach may be improved by leveraging temporal in- 
formation (observing how parts of the the communica- 
tion graph change over time) to assist in separating out 
the botnet graph. In addition, it may be desirable to 
distinguish other peer-to-peer structure from other Inter- 
net background traffic, perhaps by observing more fine- 
grained properties of communication patterns. Finally, 
we do not attempt to address the challenging problem of 
botnet response. Future work may leverage our inferred 
botnet topologies by dropping crucial links to partition 
the botnet, based on the structure of the botnet graph. 
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Abstract 


Regular expression (RE) matching is a core component 
of deep packet inspection in modern networking and 
security devices. In this paper, we propose the first 
hardware-based RE matching approach that uses Ternary 
Content Addressable Memories (TCAMs), which are 
off-the-shelf chips and have been widely deployed in 
modern networking devices for packet classification. We 
propose three novel techniques to reduce TCAM space 
and improve RE matching speed: transition sharing, ta- 
ble consolidation, and variable striding. We tested our 
techniques on 8 real-world RE sets, and our results show 
that small TCAMs can be used to store large DFAs and 
achieve potentially high RE matching throughtput. For 
space, we were able to store each of the corresponding 8 
DFAs with as many as 25,000 states in a 0.59Mb TCAM 
chip where the number of TCAM bits required per DFA 
state were 12, 12, 12, 13, 14, 26, 28, and 42. Using 
a different TCAM encoding scheme that facilitates pro- 
cessing multiple characters per transition, we were able 
to achieve potential RE matching throughputs of between 
10 and 19 Gbps for each of the 8 DFAs using only a sin- 
gle 2.36 Mb TCAM chip. 


1 Introduction 
1.1 Background and Problem Statement 


Deep packet inspection is a key part of many networking 
devices on the Internet such as Network Intrusion De- 
tection (or Prevention) Systems (NIDS/NIPS), firewalls, 
and layer 7 switches. In the past, deep packet inspec- 
tion typically used string matching as a core operator, 
namely examining whether a packet’s payload matches 
any of a set of predefined strings. Today, deep packet in- 
spection typically uses regular expression (RE) matching 
as a core operator, namely examining whether a packet’s 
payload matches any of a set of predefined regular ex- 
pressions, because REs are fundamentally more expres- 
sive, efficient, and flexible in specifying attack signatures 
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[27]. Most open source and commercial deep packet in- 
spection engines such as Snort, Bro, TippingPoint X505, 
and many Cisco networking appliances use RE match- 
ing. Likewise, some operating systems such as Cisco 
IOS and Linux have built RE matching into their layer 7 
filtering functions. As both traffic rates and signature set 
sizes are rapidly growing over time, fast and scalable RE 
matching is now a core network security issue. 

RE matching algorithms are typically based on the De- 
terministic Finite Automata (DFA) representation of reg- 
ular expressions. A DFA is a 5-tuple (Q, 2’, 6, go, A), 
where Q is a set of states, 5 is an alphabet, 6: 3) x Q > 
() is the transition function, go is the start state, and 
A C Q is a set of accepting states. Any set of regu- 
lar expressions can be converted into an equivalent DFA 
with the minimum number of states. The fundamental 
issue with DFA-based algorithms is the large amount of 
memory required to store transition table 0. We have to 
store 6(q, a) = p for each state g and character a. 

Prior RE matching algorithms are either software- 
based [4, 6, 7, 12, 16, 18, 19] or FPGA-based [5, 7, 13, 14, 
22, 24,29]. Software-based solutions have to be imple- 
mented in customized ASIC chips to achieve high-speed, 
the limitations of which include high deployment cost 
and being hard-wired to a specific solution and thus lim- 
ited ability to adapt to new RE matching solutions. AI- 
though FPGA-based solutions can be modified, resynthe- 
sizing and updating FPGA circuitry in a deployed system 
to handle regular expression updates is slow and diffi- 
cult; this makes FPGA-based solutions difficult to be de- 
ployed in many networking devices (such as NIDS/NIPS 
and firewalls) where the regular expressions need to be 
updated frequently [18]. 


1.2 Our Approach 


To address the limitations of prior art on high-speed RE 
matching, we propose the first Ternary Content Address- 
able Memory (TCAM) based RE matching solution. We 
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use a TCAM and its associated SRAM to encode the 
transitions of the DFA built from an RE set where one 
TCAM entry might encode multiple DFA transitions. 

TCAM entries and lookup keys are encoded in ternary 
as O0’s, 1’s, and *’s where *’s stand for either O or 1. 
A lookup key matches a TCAM entry if and only if 
the corresponding O’s and 1’s match; for example, key 
0001101111 matches entry 000110****, TCAM circuits 
compare a lookup key with all its occupied entries in par- 
allel and return the index (or sometimes the content) of 
the first address for the content that the key matches; this 
address is then used to retrieve the corresponding deci- 
sion in SRAM. 

Given an RE set, we first construct an equivalent min- 
imum state DFA [15]. Second, we build a two column 
TCAM lookup table where each column encodes one of 
the two inputs to 0: the source state ID and the input char- 
acter. Third, for each TCAM entry, we store the destina- 
tion state ID in the same entry of the associated SRAM. 
Fig. 1 shows an example DFA, its TCAM lookup table, 
and its SRAM decision table. We illustrate how this DFA 
processes the input stream “01101111, 01100011”. We 
form a TCAM lookup key by appending the current input 
character to the current source state ID; in this example, 
we append the first input character “01101111” to “O00”, 
the ID of the initial state sj, to form “0001101111”. The 
first matching entry is the second TCAM entry, so “OI”, 
the destination state ID stored in the second SRAM en- 
try is returned. We form the next TCAM lookup key 
“0101100011” by appending the second input character 
“011000011” to this returned state ID “O01”, and the pro- 
cess repeats. 

TCAM SRAM 


00 0110 0000 00 
00 0110 **** 01 
0110 0000 00 
0110 0010 Ol 
0110 0000 00 
0110 001* Ol 





Figure 1: A DFA with its TCAM table 


Advantages of TCAM-based RE Matching There 
are three key reasons why TCAM-based RE matching 
works well. First, a small TCAM is capable of encoding 
a large DFA with carefully designed algorithms lever- 
aging the ternary nature and first-match semantics of 
TCAMs. Our experimental results show that each of the 
DFAs built from 8 real-world RE sets with as many as 
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25,000 states, 4 of which were obtained from the authors 
of [6], can be stored in a 0.59Mb TCAM chip. The two 
DFAs that correspond to primarily string matching RE 
sets require 28 and 42 TCAM bits per DFA state; 5 of 
the remaining 6 DFAs which have a sizeable number of 
‘.*’ patterns require 12 to 14 TCAM bits per DFA state 
whereas the 6th DFA requires 26 TCAM bits per DFA 
state. Second, TCAMs facilitate high-speed RE matching 
because TCAMs are essentially high-performance paral- 
lel lookup systems: any lookup takes constant time (.e., 
a few CPU cycles) regardless of the number of occupied 
entries. Using Agrawal and Sherwood’s TCAM model 
[1] and the resulting required TCAM sizes for the 8 RE 
sets, we show that it may be possible to achieve through- 
puts ranging between 5.36 and 18.6 Gbps using only a 
single 2.36 Mb TCAM chip. Third, because TCAMs are 
off-the-shelf chips that are widely deployed in modern 
networking devices, it should be easy to design network- 
ing devices that include our TCAM based RE matching 
solution. It may even be possible to immediately deploy 
our solution on some existing devices. 


Technical Challenges There are two key technical 
challenges in TCAM-based RE matching. The first is en- 
coding a large DFA in a small TCAM. Directly encoding 
a DFA in a TCAM using one TCAM entry per transi- 
tion will lead to a prohibitive amount of TCAM space. 
For example, consider a DFA with 25000 states that con- 
sumes one 8 bit character per transition. We would need 
a total of 140.38 Mb (= 25000 x 2° x (8+ [log 25000])). 
This is infeasible given the largest available TCAM chip 
has a capacity of only 72 Mb. To address this challenge, 
we use two techniques that minimize the TCAM space 
for storing a DFA: transition sharing and table consol- 
idation. The second challenge is improving RE match- 
ing speed and thus throughput. One way to improve the 
throughput by up to a factor of k is to use k-stride DFAs 
that consume é input characters per transition. However, 
this leads to an exponential increase in both state and 
transition spaces. To avoid this space explosion, we use 
the novel idea of variable striding. 


Key Idea 1 - Transition Sharing The basic idea is to 
combine multiple transitions into one TCAM entry by 
exploiting two properties of DFA transitions: (1) char- 
acter redundancy where many transitions share the same 
source state and destination state and differ only in their 
character label, and (2) state redundancy where many 
transitions share the same character label and destina- 
tion state and differ only in their source state. One rea- 
son for the pervasive character and state redundancy in 
DFAs constructed from real-world RE sets is that most 
states have most of their outgoing transitions going to 
some common “failure” state; such transitions are often 
called default transitions. The low entropy of these DFAs 
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Opens optimization opportunities. We exploit character 
redundancy by character bundling (i.e., input character 
sharing) and state redundancy by shadow encoding (i.e., 
source state sharing). In character bundling, we use a 
ternary encoding of the input character field to repre- 
sent multiple characters and thus multiple transitions that 
share the same source and destination states. In shadow 
encoding, we use a ternary encoding for the source state 
ID to represent multiple source states and thus multiple 
transitions that share the same label and destination state. 


Key Idea 2 - Table Consolidation The basic idea is 
to merge multiple transition tables into one transition 
table using the observation that some transition tables 
share similar structures (e.g., common entries) even if 
they have different decisions. This shared structure can 
be exploited by consolidating similar transition tables 
into one consolidated transition table. When we con- 
solidate k TCAM lookup tables into one consolidated 
TCAM lookup table, we store k decisions in the asso- 
ciated SRAM decision table. 


Key Idea 3 - Variable Striding The basic idea is to 
store transitions with a variety of strides in the TCAM so 
that we increase the average number of characters con- 
sumed per transition while ensuring all the transitions fit 
within the allocated TCAM space. This idea is based on 
two key observations. First, for many states, we can cap- 
ture many but not all k-stride transitions using relatively 
few TCAM entries whereas capturing all k-stride tran- 
sitions requires prohibitively many TCAM entries. Sec- 
ond, with TCAMs, we can store transitions with different 
strides in the same TCAM lookup table. 

The rest of this paper proceeds as follows. We review 
related work in Section 2. In Sections 3, 4, and 5, we 
describe transition sharing, table consolidation, and vari- 
able striding, respectively. We present implementation 
issues, experimental results, and conclusions in Sections 
6, 7, and 8, respectively. 


2 Related Work 


In the past, deep packet inspection typically used string 
matching (often called pattern matching) as a core op- 
erator; string matching solutions have been extensively 
studied [2, 3, 28, 30, 32, 33, 35]). TCAM-based solutions 
have been proposed for string matching, but they do not 
generalize to RE matching because they only deal with 
independent strings [3, 30, 35]. 

Today deep packet inspection often uses RE match- 
ing as a core operator because strings are no longer ad- 
equate to precisely describe attack signatures [25, 27]. 
Prior work on RE matching falls into two categories: 
software-based and FPGA-based. Prior software-based 
RE matching solutions focus on either reducing mem- 
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ory by minimizing the number of transitions/states or 
improving speed by increasing the number of characters 
per lookup. Such solutions can be implemented on gen- 
eral purpose processors, but customized ASIC chip im- 
plementations are needed for high speed performance. 
For transition minimization, two basic approaches have 
been proposed: alphabet encoding that exploits charac- 
ter redundancy [6, 7, 12, 16] and default transitions that 
exploit state redundancy [4, 6, 18, 19]. Previous alphabet 
encoding approaches cannot fully exploit local charac- 
ter redundancy specific to each state. Most use a sin- 
gle alphabet encoding table that can only exploit global 
character redundancy that applies to every state. Kong 
et al. proposed using 8 alphabet encoding tables by par- 
titioning the DFA states into 8 groups with each group 
having its own alphabet encoding table [16]. Our work 
improves upon previous alphabet encoding techniques 
because we can exploit local character redundancy spe- 
cific to each state. Our work improves upon the default 
transition work because we do not need to worry about 
the number of default transitions that a lookup may go 
through because TCAMs allow us to traverse an arbitrar- 
ily long default transition path in a single lookup. Some 
transition sharing ideas have been used in some TCAM- 
based string matching solutions for Aho-Corasick-based 
DFAs [3, 11]. However, these ideas do not easily ex- 
tend to DFAs generated by general RE sets, and our 
techniques produce at least as much transition sharing 
when restricted to string matching DFAs. For state min- 
imization, two fundamental approaches have been pro- 
posed. One approach is to first partition REs into multi- 
ple groups and build a DFA from each group; at run time, 
packet payload needs to be scanned by multiple DFAs 
[5, 26, 34]. This approach is orthogonal to our work and 
can be used in combination with our techniques. In par- 
ticular, because our techniques achieve greater compres- 
sion of DFAs than previous software-based techniques, 
less partitioning of REs will be required. The other ap- 
proach is to use scratch memory to store variables that 
track the traversal history and avoid some duplication of 
states [8,17,25]. The benefit of state reduction for scratch 
memory-based FAs does not come for free. The size of 
the required scratch memory may be significant, and the 
time required to update the scratch memory after each 
transition may be significant. This approach is orthogo- 
nal to our approach. While we have only applyied our 
techniques to DFAs in this initial study of TCAM-based 
RE matching, our techniques may work very well with 
scratch memory-based automata. 


Prior FPGA-based solutions exploit the parallel pro- 
cessing capabilities of FPGA technology to implement 
nondeterministic finite automata (NFA) [5, 7, 13, 14, 22, 
24,29] or parallel DFAs [23]. While NFAs are more com- 
pact than DFAs, they require more memory bandwidth 
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to process each transition as an NFA may be in multiple 
states whereas a DFA is always only in one state. Thus, 
each character that is processed might be processed in 
up to |Q| transition tables. Prior work has looked at 
ways for finding good NFA representations of the REs 
that limit the number of states that need to be processed 
simultaneously. However, FPGA’s cannot be quickly re- 
configured, and they have clock speeds that are slower 
than ASIC chips. 

There has been work [7, 12] on creating multi-stride 
DFAs and NFAs. This work primarily applies to FPGA 
NFA implementations since multiple character SRAM 
based DFAs have only been evaluated for a small number 
of REs. The ability to increase stride has been limited 
by the constraint that all transitions must be increased 
in stride; this leads to excessive memory explosion for 
strides larger than 2. With variable striding, we increase 
stride selectively on a state by state basis. Alicherry et al. 
have explored variable striding for TCAM-based string 
matching solutions [3] but not for DFAs that apply to ar- 
bitrary RE sets. 


3 Transition Sharing 


The basic idea of transition sharing is to combine mul- 
tiple transitions into a single TCAM entry. We pro- 
pose two transition sharing ideas: character bundling and 
shadow encoding. Character bundling exploits intra-state 
optimization opportunities and minimizes TCAM tables 
along the input character dimension. Shadow encoding 
exploits inter-state optimization opportunities and mini- 
mizes TCAM tables along the source state dimension. 


3.1 Character Bundling 


Character bundling exploits character redundancy by 
combining multiple transitions from the same source 
state to the same destination into one TCAM entry. Char- 
acter bundling consists of four steps. (1) Assign each 
state a unique ID of {log |Q|| bits. (2) For each state, 
enumerate all 256 transition rules where for each rule, 
the predicate 1s a transition’s label and the decision is the 
destination state ID. (3) For each state, treating the 256 
rules as a 1-dimensional packet classifier and leveraging 
the ternary nature and first-match semantics of TCAMs, 
we minimize the number of transitions using the op- 
timal 1-dimensional TCAM minimization algorithm in 
[20,31]. (4) Concatenate the |Q| 1-dimensional minimal 
prefix classifiers together by prepending each rule with 
its source state ID. The resulting list can be viewed as a 
2-dimensional classifier where the two fields are source 
state ID and transition label and the decision is the des- 
tination state ID. Fig. 1 shows an example DFA and its 
TCAM lookup table built using character bundling. The 
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three chunks of TCAM entries encode the 256 transi- 
tions for So, 51, and s2, respectively. Without character 
bundling, we would need 256 x 3 entries. 


3.2 Shadow Encoding 


Whereas character bundling uses ternary codes in the in- 
put character field to encode multiple input characters, 
shadow encoding uses ternary codes in the source state 
ID field to encode multiple source states. 


3.2.1 Observations 


We use our running example in Fig. | to illustrate shadow 
encoding. We observe that all transitions with source 
states s; and sg have the same destination state except 
for the transitions on character c. Likewise, source state 
so differs from source states s; and s2 only in the char- 
acter range |a,o]. This implies there is a lot of state re- 
dundancy. The table in Fig. 2 shows how we can ex- 
ploit state redundancy to further reduce required TCAM 
space. First, since states s; and sg are more similar, we 
give them the state IDs 00 and O01, respectively. State 
Sq uses the ternary code of 0* in the state ID field of its 
TCAM entries to share transitions with state s;. We give 
state sg the state ID of 10, and it uses the ternary code of 
** In the state ID field of its TCAM entries to share tran- 
sitions with both states s; and s2. Second, we order the 
State tables in the TCAM so that state s1 1s first, state so 
is second, and state sg is last. This facilitates the sharing 
of transitions among different states where earlier states 
have incomplete tables deferring some transitions to later 
tables. 


TCAM SRAM 


Src State ID Dest State ID 
o1100071 


0110 001* 
0110 0000 


0110 #8 
0110 0000 
0110 


TR KK ok ok 2 ok 





Figure 2: TCAM table with shadow encoding 


We must solve three problems to implement shadow 
encoding: (1) Find the best order of the state tables in 
the TCAM given that any order is allowed. (2) Identify 
entries to remove from each state table given this order. 
(3) Choose binary IDs and ternary codes for each state 
that support the given order and removed entries. We 
solve these problems in the rest of this section. 

Our shadow encoding technique builds upon prior 
work with default transitions [4, 6, 18, 19] by exploiting 
the same state redundancy observation and using their 
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concepts of default transitions and Delayed input DFAs 
(D?FA). However, our final technical solutions are dif- 
ferent because we work with TCAM whereas prior tech- 
niques work with RAM. For example, the concept of a 
ternary state code has no meaning when working with 
RAM. The key advantage of shadow encoding in TCAM 
over prior default transition techniques is speed. Specif- 
ically, shadow encoding incurs no delay while prior de- 
fault transition techniques incur significant delay because 
a DFA may have to traverse multiple default transitions 
before consuming an input character. 


3.2.2 Determining Table Order 


We first describe how we compute the order of tables 
within the TCAM. We use some concepts such as default 
transitions and D?FA that were originally defined by Ku- 
mar et al. [18] and subsequently refined [4, 6, 19]. 


242 243 ©) 
205 @) 


(b) (c) 
Figure 3: D?FA, SRG, and deferment tree 


A D?FA is a DFA with default transitions where each 
state p can have at most one default transition to one 
other state g in the D?FA. In a legal D?FA, the di- 
rected graph consisting of only default transitions must 
be acyclic; we call this graph a deferment forest. It is a 
forest rather than a tree since more than one node may 
not have a default transition. We call a tree in a defer- 
ment forest a deferment tree. 

We determine the order of state tables in TCAM by 
constructing a deferment forest and then using the par- 
tial order defined by the deferment forest. Specifically, if 
there is a directed path from state p to state q in the defer- 
ment forest, we say that state p defers to state g, denoted 
p> q. Ifp > q, we Say that state p is in state q’s shadow. 
We use the partial order of a deferment forest to deter- 
mine the order of state transition tables in the TCAM. 
Specifically, state q’s transition table must be placed af- 
ter the transition tables of all states in state qg’s shadow. 

We compute a deferment forest that minimizes the 
TCAM representation of the resulting D?FA as follows. 
Our algorithm builds upon algorithms from prior work 
[4, 6, 18, 19], but there are several key differences. First, 
unlike prior work, we do not pay a speed penalty for long 
default transition paths. Thus, we achieve better transi- 
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tion sharing than prior work. Second, to maximize the 
potential gains from our variable striding technique de- 
scribed in Section 5 and table consolidation, we choose 
states that have lots of self-loops to be the roots of our 
deferment trees. Prior work has typically chosen roots 
in order to minimize the distance from a leaf node to a 
root, though Becchi and Crowley do consider related cri- 
teria when constructing their D?FA [6]. Third, we ex- 
plicitly ignore transition sharing between states that have 
few transitions in common. This has been done implic- 
itly in the past, but we show how doing so leads to better 
results when we use table consolidation. 


The algorithm for constructing deferment forests con- 
sists of four steps. First, we construct a Space Reduction 
Graph (SRG), which was proposed in [18], from a given 
DFA. Given a DFA with |Q| states, an SRG is a clique 
with |Q| vertices each representing a distinct state. The 
weight of each edge is the number of common (outgoing) 
transitions between the two connected states. Second, 
we trim away edges with small weight from the SRG. In 
our experiments, we use a cutoff of 10. We justify this 
step based on the following observations. A key property 
of SRGs that we observed in our experiments is that the 
weight distribution is bimodal: an edge weight is typ- 
ically either very small (< 10) or very large (> 180). 
If we use these low weight edges for default transitions, 
the resulting TCAM often has more entries. Plus, we 
get fewer deferment trees which hinders our table con- 
solidation technique (Section 4). Third, we compute a 
deferment forest by running Kruskal’s algorithm to find 
a maximum weight spanning forest. Fourth, for each de- 
ferment tree, we pick the state that has largest number of 
transitions going back to itself as the root. Fig. 3(b) and 
(c) show the SRG and the deferment tree, respectively, 
for the DFA in Fig. 1. 

We make the following key observation about the root 
states in our deferment trees. In most deferment trees, 
more than 128 (i.e., half) of the root state’s outgoing tran- 
sitions lead back to the root state; we call such a state a 
self-looping state. Based on the pigeonhole principle and 
the observed bimodal distribution, each deferment tree 
can have at most one self-looping state, and it is clearly 
the root state. We choose self-looping states as roots to 
improve the effectiveness of variable striding which we 
describe in Section 5. Intuitively, we have a very space- 
efficient method, self-loop unrolling, for increasing the 
stride of self-looping root states. The resulting increase 
in stride applies to all states that defer transitions to this 
self-looping root state. 

When we apply Kruskal’s algorithm, we use a tie 
breaking strategy because many edges have the same 
weight. To have most deferment trees centered around 
a self-looping state, we give priority to edges that have 
the self-looping state as one endpoint. If we still have a 


19th USENIX Security Symposium —=_115 


116 


tie, we favor edges by the total number of edges in the 
current spanning tree that both endpoints are connected 
to prioritize nodes that are already well connected. 


3.2.3 Choosing Transitions 


For a given DFA and a corresponding deferment forest, 
we construct a D?FA as follows. If state » has a default 
transition to state g, we remove any transitions that are 
common to both p’s transition table and q’s transition ta- 
ble from p’s transition table. We denote the default tran- 
sition in the D?FA with a dashed arrow labeled with de- 
fer. Fig. 3(a) shows the D?FA for the DFA in Fig. 1 given 
the corresponding deferment forest (a deferment tree in 
this case) in Figure 3(c). We now compute the TCAM 
entries for each transition table. 

(1) For each state, enumerate all individual transition 
rules except the deferred transitions. For each transition 
rule, the predicate is the label of the transition and the 
decision is the state ID of the destination state. For now, 
we just ensure each state has a unique state ID. Thus, we 
get an incomplete 1-dimensional classifier for each state. 
(2) For each state, we minimize its transition table using 
the 1-dimensional incomplete classifier minimization al- 
gorithm in [21]. This algorithm works by first adding a 
default rule with a unique decision that has weight larger 
than the size of the domain, then applying the weighted 
one-dimensional TCAM minimization algorithm in [20] 
to the resulting complete classifier, and finally remove 
the default rule, which is guaranteed to remain the default 
rule in the minimal complete classifier due to its huge 
weight. In our solution, the character bundling technique 
is used in this step. We also consider some optimizations 
where we specify some deferred transitions to reduce the 
total number of TCAM entries. For example, the second 
entry in S9’s table in Fig. 2 is actually a deferred transi- 
tion to state so’s table, but not using it would result in 4 
TCAM entries to specify the transitions that sz does not 
share with So. 


3.2.4 Shadow Encoding Algorithm 


To ensure that proper sharing of transitions occurs, we 
need to encode the source state [Ds of the TCAM entries 
according to the following shadow encoding scheme. 
Each state is assigned a binary state ID and a ternary 
shadow code. State IDs are used in the decisions of tran- 
sition rules. Shadow codes are used in the source state 
ID field of transition rules. In a valid assignment, every 
state ID and shadow code must have the same number of 
bits, which we call the shadow length of the assignment. 
For each state p, we use I D(p) and SC'(p) to denote the 
state ID and shadow code of p. A valid assignment of 
state IDs and shadow codes for a deferment forest must 
satisfy the following four shadow encoding properties: 
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1. Uniqueness Property: For any two distinct states p 
and q, !D(p) 4 ID(q) and SC(p) 4 SC(q). 


2. Self-Matching Property: For any state p, ID(p) € 
SC(p) @e., [D(p) matches S'C(p)). 


3. Deferment Property: For any two states p and q, p > 
q (i.e., g 1S an ancestor of p in the given deferment 
tree) if and only if SC(p) C SC(q). 


4. Non-interception Property: For any two distinct 
states p and q, p > qif and only if [D(p) € SC(q). 


Intuitively, g’s shadow code must include the state ID of 
all states in g’s shadow and cannot include the state ID 
of any states not in q’s shadow. 

We give an algorithm for computing a valid assign- 
ment of state IDs and shadow codes for each state given 
a single deferment tree D7’. We handle deferment forests 
by simply creating a virtual root node whose children are 
the roots of the deferment trees in the forest and then run- 
ning the algorithm on this tree. In the following, we refer 
to states as nodes. 

Our algorithm uses the following internal variables for 
each node v: a local binary ID denoted L(v), a global 
binary ID denoted G(v), and an integer weight denoted 
W(v) that is the shadow length we would use for the 
subtree of DT’ rooted at v. Intuitively, the state ID of 
v will be G(v)|L(v) where | denotes concatenation, and 
the shadow code of v will be the prefix string G(v) fol- 
lowed by the required number of *’s; some extra padding 
characters may be needed. We use #L(v) and #G(v)to 
denote the number of bits in L(v) and G(v), respectively. 

Our algorithm processes nodes in a bottom-up fashion. 
For each node v, we initially set L(v) = G(v) = 0 and 
W(v) = 0. Each leaf node of DT is now processed, 
which we denote by marking them red. We process an 
internal node v when all its children v1,--- , uv, are red. 
Once a node v is processed, its weight W (v) and its local 
ID L(v) are fixed, but we will prepend additional bits to 
its global ID G(v) when we process its ancestors in DT. 

We assign v and each of its children a variable-length 
binary code, which we call HCode. The HCode provides 
a unique signature that uniquely distinguishes each of the 
n + 1 nodes from each other while satisfying the four re- 
quired shadow code properties. One option would be to 
simply use lg(n + 1) bits and assign each node a binary 
number from 0 to n. However, to minimize the shadow 
code length W(wv), we use a Huffman coding style algo- 
rithm instead to compute the HCodes and W(v). This 
algorithm uses two data structures: a binary encoding 
tree T’ with n + 1 leaf nodes, one for v and each of its 
children, and a min-priority queue, initialized with n + 1 
elements, one for v and each of its children, that is or- 
dered by node weight. While the priority queue has more 
than one element, we remove the two elements x and y 
with lowest weight from the priority queue, create a new 
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HCode: 000 001 
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Figure 4: Shadow encoding example 


internal node z in T' with two children x and y and set 
weight(z)=maximum(weight(x), weight(y))+1, and then 
put element z into the priority queue. When there is only 
a single element in the priority queue, the binary encod- 
ing tree 7’ is complete. The HCode assigned to each leaf 
node wv’ is the path in T from the root node to v’ where 
left edges have value O and right edges have value 1. We 
update the internal variables of v and its descendants in 
DT as follows. We set L(v) to be its HCode, and W(v) 
to be the weight of the root node of T; G(v) is left empty. 
For each child v;, we prepend v;’s HCode to the global 
ID of every node in the subtree rooted at v; including v; 
itself. We then mark v as red. This continues until all 
nodes are red. 

We now assign each node a state ID and a shadow 
code. First, we set the shadow length to be k, the weight 
of the root node of DT’. We use {*}” to denote a ternary 
string with m number of *’s and {0} to denote a bi- 
nary string with m number of 0’s. For each node v, 
we compute uv’s state ID and shadow code as follows: 
ID(v) = Glw)|L(v)|{O}* *OM-F2™, SC(v) = 
G(v)|{*}*-#¢(), We illustrate our shadow encoding 
algorithm in Figure 4. Figure 4(a) shows all the inter- 
nal variables just before v; is processed. Figure 4(b) 
shows the Huffman style binary encoding tree 7’ built 
for node v1 and its children vo, v3, and v4 and the result- 
ing HCodes. Figure 4(c) shows each node’s final weight, 
global ID, local ID, state ID and shadow code. 

Experimentally, we found that our shadow encoding 
algorithm is effective at minimizing shadow length. No 
DFA had a shadow length larger than [log, |Q]] +3, and 
| log, |Q|] is the minimum possible shadow length. 


4 Table Consolidation 


We now present table consolidation where we combine 
multiple transition tables for different states into a single 
transition table such that the combined table takes less 
TCAM space than the total TCAM space used by the 
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original tables. To define table consolidation, we need 
two new concepts: k-decision rule and k-decision table. 
A k-decision rule is a rule whose decision is an array 
of k decisions. A k-decision table is a sequence of k- 
decision rules following the first-match semantics. Given 
a k-decision table T and 2 (0 <i < k), if for any rule r 
in T we delete all the decisions except the 2-th decision, 
we get a 1-decision table, which we denote as T|2]. In 
table consolidation, we take a set of & 1-decision tables 
To,-:: , 12,%—1 and construct a k-decision table T such 
that for any i (0 <7 < k), the condition T; = T|2] holds 
where T; = T|t] means that T; and T|2] are equivalent 
(i.e., they have the same decision for every search key). 
We call the process of computing k-decision table T ta- 
ble consolidation, and we call T the consolidated table. 


4.1 Observations 


Table consolidation is based three observations. First, 
semantically different TCAM tables may share common 
entries with possibly different decisions. For example, 
the three tables for so, s; and sg in Fig. | have three en- 
tries in common: 01100000, O110****, and *****#%#% 
Table consolidation provides a novel way to remove such 
information redundancy. Second, given any set of k& 1- 
decision tables T9,--- ,T,%—1, we can always find a k- 
decision table T’ such that for any 7 (0 < z < k), the 
condition T; = Tz] holds. This is easy to prove as 
we can use one entry per each possible binary search 
key in T. Third, a TCAM chip typically has a build-in 
SRAM module that is commonly used to store lookup 
decisions. For a TCAM with n entries, the SRAM mod- 
ule is arranged as an array of n entries where SRAM[i] 
stores the decision of TCAM|[i] for every 2. A TCAM 
lookup returns the index of the first matching entry in the 
TCAM, which is then used as the index to directly find 
the corresponding decision in the SRAM. In table con- 
solidation, we essentially trade SRAM space for TCAM 
space because each SRAM entry needs to store multiple 
decisions. As SRAM is cheaper and more efficient than 
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TCAM, moderately increasing SRAM usage to decrease 
TCAM usage is worthwhile. 

Fig. 5 shows the TCAM lookup table and the SRAM 
decision table for a 3-decision consolidated table for 
states So, 51, and sg in Fig. 1. In this example, by table 
consolidation, we reduce the number of TCAM entries 
from 11 to 5 for storing the transition tables for states 
So, S51, and Ss». This consolidated table has an ID of 0. 
As both the table ID and column ID are needed to en- 
code a state, we use the notation < Table ID > @ < 
Column ID > to represent a state. 


TCAM SRAM 
Consolidated Input 
rar 0 
0110 0000 SO SO SO 
0 110 OO 10 S] $1 


) 
0 0110 0011 S1 | So 
0 


O110 **** S1 | S2 
choke eke kek ee. |. 6% 





Figure 5: 3-decision table for 3 states in Fig. 1 


There are two key technical challenges in table con- 
solidation. The first challenge is how to consolidate k 
1-decision transition tables into a k-decision transition 
table. The second challenge is which 1-decision transi- 
tion tables should be consolidated together. Intuitively, 
the more similar two 1-decision transition tables are, the 
more TCAM space saving we can get from consolidating 
them together. However, we have to consider the defer- 
ment relationship among states. We present our solutions 
to these two challenges. 


4.2 Computing a é-decision table 


In this section, we assume we know which states need to 
be consolidated together and present a local state consol- 
idation algorithm that takes a k,-decision table for state 
set S; and a k2-decision table for another state set S; as 
its input and outputs a consolidated (k; + k2)-decision 
table for state set S; US;. For ease of presentation, we 
first assume that ky = ko = 1. 

Let s; and sq be the two input states which have de- 
fault transitions to states s3 and s4. We enforce a con- 
straint that if we do not consolidate s3 and s4 together, 
then s; and sz cannot defer any transitions at all. If we do 
consolidate s3 and s4 together, then s; and sg may have 
incomplete transition tables due to default transitions to 
s3 and s4, respectively. We assign state s; column ID 0 
and state s2 column ID 1. This consolidated table will be 
assigned a common table ID X. Thus, we encode s, as 
X @0 and so as X @1. 

The key concepts underlying this algorithm are break- 
points and critical ranges. To define breakpoints, it is 
helpful to view »’ as numbers ranging from 0 to || — 1; 
given 8 bit characters, |’| = 256. For any state s, we 
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define a character 2 € » to be a breakpoint for s if 
0(s,7) # 6(s,i4 — 1). For the end cases, we define 0 
and |»’| to be breakpoints for every state s. Let b(s) 
be the set of breakpoints for state s. We then define 
b(S) = U,eg b(s) to be the set of breakpoints for a 
set of states S C Q. Finally, for any set of states S, 
we define r(.S) to be the set of ranges defined by b(S): 
r(S) = {[0, 62 — 1], [b2, b3 — 1], ---, [Bjsy|—a, |2] — 1} 
where 6; is ith smallest breakpoint in b(S'). Note that 
0 = b, is the smallest breakpoint and |»’| is the largest 
breakpoint in b(.S). Within r(.S), we label the range be- 
ginning at breakpoint b; as r; for 1 <i < |b(S)| — 1. If 
0(s, b;) is deferred, then r; is a deferred range. 

When we consolidate s; and sg together, we compute 
b({s1, 52}) and r({s1,82}). For each r’ € r({s1, s2}) 
where r’ is not a deferred range for both s; and s, we 
create a consolidated transition rule where the decision 
of the entry is the ordered pair of decisions for state 51 
and s2 on r’. For each r’ € r({s1,52}) where r’ is a 
deferred range for one of s; but not the other, we fill in 
r’ in the incomplete transition table where it is deferred, 
and we create a consolidated entry where the decision of 
the entry is the ordered pair of decisions for state s; and 
so on’. Finally, for each r’ € r({s1,82}) where r’ is 
a deferred range for both s; and s2, we do not create a 
consolidated entry. This produces a non-overlapping set 
of transition rules that may be incomplete if some ranges 
do not have a consolidated entry. If the final consolidated 
transition table is complete, we minimize it using the 
optimal 1-dimensional TCAM minimization algorithm 
in [20,31]. If the table is incomplete, we minimize it 
using the 1-dimensional incomplete classifier minimiza- 
tion algorithm in [21]. We generalize this algorithm to 
cases where k, > 1 and kg > 1 by simply considering 
k, + kg states when computing breakpoints and ranges. 


4.3 Choosing States to Consolidate 


We now describe our global consolidation algorithm for 
determining which states to consolidate together. As we 
observed earlier, if we want to consolidate two states 
Ss, and sg together, we need to consolidate their parent 
nodes in the deferment forest as well or else lose all the 
benefits of shadow encoding. Thus, we propose to con- 
solidate two deferment trees together. 

A consolidated deferment tree must satisfy the follow- 
ing properties. First, each node 1s to be consolidated with 
at most one node in the second tree; some nodes may not 
be consolidated with any node in the second tree. Sec- 
ond, a level 7 node in one tree must be consolidated with 
a level 2 node in the second tree. The level of a node 
is its distance from the root. We define the root to be a 
level O node. Third, if two level 2 nodes are consolidated 
together, their level 2 — 1 parent nodes must also be con- 
solidated together. An example legal matching of nodes 
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between two deferment trees is depicted in Fig. 6. 
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Figure 6: Consolidating two trees 


Given two deferment trees, we start the consolidation 
process from the roots. After we consolidate the two 
roots, we need to decide how to pair their children to- 
gether. For each pair of nodes that are consolidated to- 
gether, we again must choose how to pair their children 
together, and so on. We make an optimal choice using 
a combination of dynamic programming and matching 
techniques. Our algorithm proceeds as follows. Suppose 
we wish to compute the minimum cost C(x, y), mea- 
sured in TCAM entries, of consolidating two subtrees 
rooted at nodes x and y where x has u children X = 
{%1,.--,%y} and y has v children Y = {y1,..-, Yu}. 
We first recursively compute C'(2;, Us) forl <i<u 
and 1 < 7 < v using our local state consolidation al- 
gorithm as a subroutine. We then construct a complete 
bipartite graph Kx y such that each edge (x;, y;) has 
the edge weight C(x;,y;) forl <i <uand1 <j <v. 
Here C(x, y) is the cost of a minimum weight match- 
ing of K(X, Y) plus the cost of consolidating x and y. 
When |X| 4 |Y|, to make the sets equal in size, we pad 
the smaller set with null states that defer all transitions. 





Finally, we must 
decide which trees 
to consolidate to- 
gether. We as- 
sume that we pro- 
duce k-decision ta- 
bles where & is a 
power of 2. We 
describe how we 
solve the problem 
for k = 2 first. 
We create an edge- 
weighted complete 
graph with where 
each deferment tree 
is anode and where 
the weight of each edge is the cost of consolidating the 
two corresponding deferment trees together. We find a 
minimum weight matching of this complete graph to give 
us an optimal pairing for k = 2. For larger k = 2', we 
then repeat this process / — 1 times. Our matching is not 
necessarily optimal for k > 2. 





Figure 7: DFA for {a.*bc, cde} 
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In some cases, the deferment forest may have only one 
tree. In such cases, we consider consolidating the sub- 
trees rooted at the children of the root of the single defer- 
ment tree. We also consider similar options if we have a 
few deferment trees but they are not structurally similar. 


4.4 Effectiveness of Table Consolidation 


We now explain why table consolidation works well on 
real-world RE sets. Most real-world RE sets contain 
REs with wildcard closures ‘*.*’ where the wildcard *.’ 
matches any character and the closure ‘*’ allows for un- 
limited repetitions of the preceding character. Wildcard 
closures create deferment trees with lots of structural 
similarity. For example, consider the D?FA in Fig. 7 
for RE set \{a.*bc, cde\} where we use dashed ar- 
rows to represent the default transitions. The wildcard 
closure ‘.*’ in the RE a. x*bc duplicates the entire DFA 
sub-structure for recognizing string cde. Thus, table 
consolidation of the subtree (0,1, 2,3) with the subtree 
(4,5,6, 7) will lead to significant space saving. 


5 Variable Striding 


We explore ways to improve RE matching throughput by 
consuming multiple characters per TCAM lookup. One 
possibility is a k-stride DFA which uses k-stride transi- 
tions that consume k characters per transition. Although 
k-stride DFAs can speed up RE matching by up to a fac- 
tor of k, the number of states and transitions can grow 
exponentially in k. To limit the state and transition space 
explosion, we propose variable striding using variable- 
stride DFAs. A k-var-stride DFA consumes between 1 
and é& characters in each transition with at least one tran- 
sition consuming k characters. Conceptually, each state 
in a k-var-stride DFA has 256* transitions, and each tran- 
sition is labeled with (1) a unique string of k characters 
and (2) a stride length 7 (1 < 7 < k) indicating the num- 
ber of characters consumed. 

In TCAM-based variable striding, each TCAM lookup 
uses the next & consecutive characters as the lookup key, 
but the number of characters consumed in the lookup 
varies from | to k; thus, the lookup decision contains 
both the destination state ID and the stride length. 


5.1 Observations 


We use an example to show how variable striding can 
achieve a significant RE matching throughput increase 
with a small and controllable space increase. Fig. 8 
shows a 3-var-stride transition table that corresponds to 
state sg in Figure |. This table only has 7 entries as op- 
posed to 116 entries in a full 3-stride table for sp. If we 
assume that each of the 256 characters is equally likely 
to occur, the average number of characters consumed per 


19th USENIX Security Symposium =_119 


120 


3-var-stride transition of sg is 1 * 1/16 + 2 * 15/256 + 
3 « 225/256 = 2.82. 


TCAM SRAM 


DEC: Stride 


hk KK ck oK KK oR ooo oo ROK ORK OK OK OK OK OK OK 





Figure 8: 3-var-stride transition table for so 


5.2 Eliminating State Explosion 


We first explain how converting a 1-stride DFA to a k- 
stride DFA causes state explosion. For a source state and 
a destination state pair (s,d), a k-stride transition path 
from s to d may contain k—1 intermediate states (exclud- 
ing d); for each unique combination of accepting states 
that appear on a k-stride transition path from s to d, we 
need to create a new destination state because a unique 
combination of accepting states implies that the input has 
matched a unique combination of REs. This can be a 
very large number of new states. 

We eliminate state explosion by ending any k-var- 
stride transition path at the first accepting state it reaches. 
Thus, a k-var-stride DFA has the exact same state set 
as its corresponding 1-stride DFA. Ending k-var-stride 
transitions at accepting states does have subtle interac- 
tions with table consolidation and shadow encoding. We 
end any k-var-stride consolidated transition path at the 
first accepting state reached in any one of the paths being 
consolidated which can reduce the expected throughput 
increase of variable striding. There is a similar but even 
more subtle interaction with shadow encoding which we 
describe in the next section. 


5.3. Controlling Transition Explosion 


In a k-stride DFA converted from a 1-stride DFA with al- 
phabet &, a state has |X’|" outgoing k-stride transitions. 
Although we can leverage our techniques of character 
bundling and shadow encoding to minimize the number 
of required TCAM entries, the rate of growth tends to be 
exponential with respect to stride length k. We have two 
key ideas to control transition explosion: k-var-stride 
transition sharing and self-loop unrolling. 


5.3.1 k-var-stride Transition Sharing Algorithm 


Similar to 1-stride DFAs, there are many transition shar- 
ing opportunities in a k-var-stride DFA. Consider two 
States So and s; in a |-stride DFA where So defers to 51. 
The deferment relationship implies that so shares many 
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common |-stride transitions with s,. In the k-var-stride 
DFA constructed from the 1-stride DFA, all &-var-stride 
transitions that begin with these common I-stride tran- 
sitions are also shared between so and s;. Furthermore, 
two transitions that do not begin with these common I- 
stride transitions may still be shared between so and s}. 
For example, in the |-stride DFA fragment in Fig. 9, al- 
though s; and s2 do not share a common transition for 
character a, when we construct the 2-var-stride DFA, s1 
and s2 share the same 2-stride transition on string aa that 
ends at state ss. 


To promote 
transition sharing 
among states in a 
k-var-stride DFA, 
we first need to 
decide on_ the 
deferment rela- 
tionship among 
states. The ideal Figure 9: s; and s9 share transi- 
deferment ela tionaa 
tionship should be calculated based on the SRG of the 
final k-var-stride DFA. However, the k-var-stride DFA 
cannot be finalized before we need to compute the 
deferment relationship among states because the final 
k-var-stride DFA is subject to many factors such as 
available TCAM space. There are two approximation 
options for the final k-var-stride DFA for calculating 
the deferment relationship: the 1-stride DFA and the 
full k-stride DFA. We have tried both options in our 
experiments, and the difference in the resulting TCAM 
space is negligible. Thus, we simply use the deferment 
forest of the 1-stride DFA in computing the transition 
tables for the k-var-stride DFA. 


Second, for any two states s; and s2 where s, defers to 
S2, we need to compute s,’s k-var-stride transitions that 
are not shared with s2 because those transitions will con- 
stitute s,’s k-var-stride transition table. Although this 
computation is trivial for 1-stride DFAs, this is a sig- 
nificant challenge for k-var-stride DFAs because each 
state has too many (256) k-var-stride transitions. The 
straightforward algorithm that enumerates all transitions 
has a time complexity of O(|Q|?|’|"), which grows ex- 
ponentially with k. We propose a dynamic program- 
ming algorithm with a time complexity of O(|Q|?|'|k), 
which grows linearly with k. Our key idea is that the 
non-shared transitions for a k-stride DFA can be quickly 
computed from the non-shared transitions of a (k-1)-var- 
stride DFA. For example, consider the two states s; and 
sq in Fig. 9 where s; defers to sg. For character a, 51 
transits to s3 while sq transits to s4. Assuming that we 
have computed all (k-1)-var-stride transitions of s3 that 
are not shared with the (k-1)-var-stride transitions of s4, 
if we prepend all these (k-1)-var-stride transitions with 
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character a, the resulting k-var-stride transitions of s1 are 
all not shared with the k-var-stride transitions of s2, and 
therefore should all be included in s;’s k-var-stride tran- 
sition table. Formally, using n(s;,s;,k) to denote the 
number of k-stride transitions of s; that are not shared 
with s;, our dynamic programming algorithm uses the 
following recursive relationship between n(s;,s,;,k) and 
Rsisik = 1) 


n(sirsj.0) = 4 tea (1) 
1 S73.85,.K) = S_ n(5(s:,c), 6(8;,€),# — 1) (2) 
Gea 


The above formulae assume that the intermediate 
states on the k-stride paths starting from s; or s; are all 
non-accepting. For state s;, we stop increasing the stride 
length along a path whenever we encounter an accepting 
state on that path or on the corresponding path starting 
from s;. The reason is similar to why we stop a con- 
solidated path at an accepting state, but the reasoning is 
more subtle. 

Let p be the string that leads s; to an accepting state. 
The key observation is that we know that any k-var-stride 
path that starts from s; and begins with p ends at that ac- 
cepting state. This means that s; cannot exploit transition 
sharing on any strings that begin with p. 

The above dynamic programming algorithm produces 
non-overlapping and and incomplete transition tables 
that we compress using the 1-dimensional incomplete 
classifier minimization algorithm in [21]. 


5.3.2 Self-Loop Unrolling Algorithm 


We now consider root states, most of which are self- 
looping. We have two methods to compute the k-var- 
stride transition tables of root states. The first 1s direct 
expansion (stopping transitions at accepting states) since 
these states do not defer to other states which results in 
an exponential increase in table size with respect to k. 
The second method, which we call self-loop unrolling, 
scales linearly with k. 

Self-loop unrolling increases the stride of all the self- 
loop transitions encoded by the last default TCAM entry. 
Self-loop unrolling starts with a root state 7-var-stride 
transition table encoded as a compressed TCAM table of 
nm entries with a final default entry representing most of 
the self-loops of the root state. Note that given any com- 
plete TCAM table where the last entry is not a default 
entry, we can always replace that last entry with a default 
entry without changing the semantics of the table. We 
generate the (j+1)-var-stride transition table by expand- 
ing the last default entry into n new entries, which are 
obtained by prepending 8 *s as an extra default field to 
the beginning of the original n entries. This produces 
a (j+1)-var-stride transition table with 2n — 1 entries. 
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Fig. 8 shows the resulting table when we apply self-loop 
unrolling twice on the DFA in Fig. 1. 


5.4 Variable Striding Selection Algorithm 


We now propose solutions for the third key challenge - 
which states should have their stride lengths increased 
and by how much, i.e., how should we compute the tran- 
sition function 0. Note that each state can independently 
choose its variable striding length as long as the final 
transition tables are composed together according to the 
deferment forest. This can be easily proven based on 
the way that we generate k-var-stride transition tables. 
For any two states s; and sg where s; defers to so, the 
way that we generate s,’s k-var-stride transition table 
is seemingly based on the assumption that s9’s transi- 
tion table is also k-var-stride; actually, we do not have 
this assumption. For example, if we choose k-var-stride 
(2 < k) for s; and 1-stride for so, all strings from s; 
will be processed correctly; the only issue is that strings 
deferred to s2 will process only one character. 


We view this as a packing problem: given a TCAM 
capacity C’, for each state s, we select a variable stride 
length value /’,, such that >) .<¢ |T(s, Ks)| < C, where 
T(s, K,) denotes the K,-var-stride transition table of 
state s. This packing problem has a flavor of the knap- 
sack problem, but an exact formulation of an optimiza- 
tion function is impossible without making assumptions 
about the input character distribution. We propose the 
following algorithm for finding a feasible 6 that strives 
to maximize the minimum stride of any state. First, we 
use all the 1-stride tables as our initial selection. Second, 
for each j-var-stride (7 > 2) table t of state s, we create 
a tuple (1, d, |t|) where / denotes variable stride length, d 
denotes the distance from state s to the root of the defer- 
ment tree that s belongs to, and |t| denotes the number 
of entries in ¢t. As stride length / increases, the individual 
table size |t| may increase significantly, particularly for 
the complete tables of root states. To balance table sizes, 
we Set limits on the maximum allowed table size for root 
states and non-root states. If a root state table exceeds the 
root state threshold when we create its 7-var-stride table, 
we apply self-loop unrolling once to its (7 — 1)-var-stride 
table to produce a j-var-stride table. If a non-root state 
table exceeds the non-root state threshold when we cre- 
ate its 7-var-stride table, we simply use its 7 — 1-var-stride 
table as its 7-var-stride table. Third, we sort the tables by 
these tuple values in increasing order first using /, then 
using d, then using |t|, and finally a pseudorandom coin 
flip to break ties. Fourth, we consider each table ¢ in or- 
der. Let t’ be the table for the same state s in the current 
selection. If replacing t’ by t does not exceed our TCAM 
capacity C’, we do the replacement. 
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6 Implementation and Modeling 


Entries TCAM TCAM Latency 
Chip size Chip size ns 
(36-bit wide) | (72-bit wide) 
1024 0.037 Mb 0.074 Mb 
2048 0.074 Mb 0.147 Mb 
4096 0.147 Mb 0.295 Mb 


8192 0.295 Mb 0.590 Mb 
16384 0.590 Mb 1.18 Mb 
32768 1.18 Mb 2.36 Mb 
65536 2.36 Mb 4.72 Mb 

131072 4.72 Mb 9.44 Mb 





Table 1: TCAM size in Mb and Latency in ns 


We now describe some implementation issues associ- 
ated with our TCAM based RE matching solution. First, 
the only hardware required to deploy our solution is the 
off-the-shelf TCAM (and its associated SRAM). Many 
deployed networking devices already have TCAMs, but 
these TCAMs are likely being used for other purposes. 
Thus, to deploy our solution on existing network devices, 
we would need to share an existing TCAM with another 
application. Alternatively, new networking devices can 
be designed with an additional dedicated TCAM chip. 

Second, we describe how we update the TCAM when 
an RE set changes. First, we must compute a new DFA 
and its corresponding TCAM representation. For the 
moment, we recompute the TCAM representation from 
scratch, but we believe a better solution can be found and 
is something we plan to work on in the future. We report 
some timing results in our experimental section. Fortu- 
nately, this is an offline process during which time the 
DFA for the original RE set can still be used. The sec- 
ond step is loading the new TCAM entries into TCAM. If 
we have a second TCAM to support updates, this rewrite 
can occur while the first TCAM chip 1s still processing 
packet flows. If not, RE matching must halt while the 
new entries are loaded. This step can be performed very 
quickly, so the delay will be very short. In contrast, up- 
dating FPGA circuitry takes significantly longer. 

We have not developed a full implementation of our 
system. Instead, we have only developed the algorithms 
that would take an RE set and construct the associated 
TCAM entries. Thus, we can only estimate the through- 
put of our system using TCAM models. We use Agrawal 
and Sherwood’s TCAM model [1] assuming that each 
TCAM chip is manufactured with a 0.18um process to 
compute the estimated latency of a single TCAM lookup 
based on the number of TCAM entries searched. These 
model latencies are shown in Table 1. We recognize that 
some processing must be done besides the TCAM lookup 
such as composing the next state ID with the next input 
character; however, because the TCAM lookup latency is 
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much larger than any other operation, we focus only on 
this parameter when evaluating the potential throughput 
of our system. 


7 Experimental Results 


In this section, we evaluate our TCAM-based RE match- 
ing solution on real-world RE sets focusing on two met- 
rics: TCAM space and RE matching throughput. 


7.1 Methodology 


We obtained 4 proprietary RE sets, namely C7, C8, C10, 
and C613, from a large networking vendor, and 4 public 
RE sets, namely Snort24, Snort31, Snort34, and Bro217 
from the authors of [6] (we do report a slightly differ- 
ent number of states for Snort31, 20068 to 20052; this 
may be due to Becchi et al. making slight changes to 
their Regular Expression Processor that we used). Quot- 
ing Becchi et al. [6], “Snort rules have been filtered ac- 
cording to the headers ($HOME_NET, any, $EXTER- 
NAL_NET, $HTTP_PORTS/any) and ($HOME_NET, 
any, 25, SHTTP_PORTS/any). In the experiments which 
follow, rules have been grouped so to obtain DFAs with 
reasonable size and, in parallel, have datasets with dif- 
ferent characteristics in terms of number of wildcards, 
frequency of character ranges and so on.” Of these 8 RE 
sets, the REs in C613 and Bro217 are all string match- 
ing REs, the REs in C7, C8, and C10 all contain wild- 
card closures ‘.*’, and about 40% of the REs in Snort 24, 
Snort31, and Snort34 contain wildcard closures °.*’. 

Finally, to test the scalability of our algorithms, we 
use one family of 34 REs from a recent public release 
of the Snort rules with headers (S5/EXTERNAL_NET, 
$HTTP_PORTS, $HOME_NET, any), most of which 
contain wildcard closures ‘.*’. We added REs one at a 
time until the number of DFA states reached 305,339. 
We name this family Scale. 

We calculate TCAM space by multiplying the number 
of entries by the TCAM width: 36, 72, 144, 288, or 576 
bits. For a given DFA, we compute a minimum width by 
summing the number of state ID bits required with the 
number of input bits required. In all cases, we needed at 
most 16 state ID bits. For 1-stride DFAs, we need exactly 
8 input character bits, and for 7-var-stride DFAs, we need 
exactly 56 input character bits. We then calculate the 
TCAM width by rounding the minimum width up to the 
smallest larger legal TCAM width. For all our 1-stride 
DFAs, we use TCAM width 36. For all our 7-var-stride 
DFAs, we use TCAM width 72. 

We estimate the potential throughput of our TCAM- 
based RE matching solution by using the model TCAM 
lookup speeds we computed in Section 6 to determine 
how many TCAM lookups can be performed in a second 
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TS + TC2 TS + TC4 


RE set # states TCAM #Entries — throughput 
megabits per state Gbps 


Bro217 6533 
C613 11308 
C10 14868 

24750 

3108 
Snort24 13886 
Snort3 1 20068 
Snort34 13825 


TCAM _ #Entries thru TCAM — #Entries thru 
megabits perstate Gbps | megabits perstate Gbps 





Table 2: TCAM size and throughput for 1-stride DFAs 


for a given number of TCAM entries and then multiply- 
ing this number by the number of characters processed 
per TCAM lookup. With 1-stride TCAMs, the number 
of characters processed per lookup is 1. For 7-var-stride 
DFAs, we measure the average number of characters pro- 
cessed per lookup in a variety of input streams. We use 
Becchi et al.’s network traffic generator [9] to generate 
a variety of synthetic input streams. This traffic gener- 
ator includes a parameter that models the probability of 
malicious traffic pjy. With probability pj, the next char- 
acter is chosen so that it leads away from the start state. 
With probability (1 — pyz), the next character is chosen 
uniformly at random. 


7.2 Results on 1-stride DFAs 


Table 2 shows our experimental results on the 8 RE sets 
using 1-stride DFAs. We use TS to denote our transition 
sharing algorithm including both character bundling and 
shadow encoding. We use TC2 and TC4 to denote our 
table consolidation algorithm where we consolidate at 
most 2 and 4 transition tables together, respectively. For 
each RE set, we measure the number states in its |-stride 
DFA, the resulting TCAM space in megabits, the average 
number of TCAM table entries per state, and the pro- 
jected RE matching throughput; the number of TCAM 
entries is the number of states times the average number 
of entries per state. The TS column shows our results 
when we apply TS alone to each RE set. The TS+TC2 
and TS+TC4 columns show our results when we apply 
both TS and TC under the consolidation limit of 2 and 4, 
respectively, to each RE set. 

We draw the following conclusions from Table 2. (1) 
Our RE matching solution is extremely effective in saving 
TCAM space. Using TS+TC4, the maximum TCAM size 
for the 8 RE sets is only 0.50 Mb, which is two orders of 
magnitude smaller than the current largest commercially 
available TCAM chip size of 72 Mb. More specifically, 
the number of TCAM entries per DFA state ranges be- 
tween .32 and 1.17 when we use TC4. We require 16, 
32, or 64 SRAM bits per TCAM entry for TS, TS+TC2, 
and TS+TC4, respectively as we need to record 1, 2, or 
4 state 16 bit state IDs in each decision, respectively. 
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(2) Transition sharing alone is very effective. With the 
transition sharing algorithm alone, the maximum TCAM 
size is only 1.43Mb for the 8 RE sets. Furthermore, we 
see a relatively tight range of TCAM entries per state of 
1.16 to 2.07. Transition sharing works extremely well 
with all 8 RE sets including those with wildcard clo- 
sures and those with primarily strings. (3) Table con- 
solidation is very effective. On the 8 RE sets, adding 
TC2 to TS improves compression by an average of 41% 
(ranging from 16% to 49%) where the maximum pos- 
sible is 50%. We measure improvement by computing 
(TS — (TS + TC2))/TS). Replacing TC2 with TC4 
improves compression by an average of 36% (ranging 
from 13% to 47%) where we measure improvement by 
computing ((7'S+7C2) —(TS+7C4))/(TS+TC2). 
Here we do observe a difference in performance, though. 
For the two RE sets Bro217 and C613 that are primarily 
strings without table consolidation, the average improve- 
ments of using TC2 and TC4 are only 24% and 15%, 
respectively. For the remaining six RE sets that have 
many wildcard closures, the average improvements are 
47% and 43%, respectively. The reason, as we touched 
on in Section 4.4, is how wildcard closure creates multi- 
ple deferment trees with almost identical structure. Thus 
wildcard closures, the prime source of state explosion, 1s 
particularly amenable to compression by table consoli- 
dation. In such cases, doubling our table consolidation 
limit does not greatly increase SRAM cost. Specifically, 
while the number of SRAM bits per TCAM entry dou- 
bles as we double the consolidation limit, the number 
of TCAM entries required almost halves! (4) Our RE 
matching solution achieves high throughput with even I- 
stride DFAs. For the TS+TC4 algorithm, on the 8 RE 
sets, the average throughput is 4.60Gbps (ranging from 
3.64Gbps to 8.51 Gbps). 


We use our Scale dataset to assess the scalability of 
our algorithms’ performance focusing on the number of 
TCAM entries per DFA state. Fig. 10(a) shows the num- 
ber of TCAM entries per state for TS, TS+TC2, and 
TS+TC4 for the Scale REs containing 26 REs (with DFA 
size 1275) to 34 REs (with DFA size 305,339). The DFA 
size roughly doubled for every RE added. In general, the 
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Figure 10: TCAM entries per DFA state (a) and compute 
time per DFA state (b) for Scale 26 through Scale 34. 


number of TCAM entries per state is roughly constant 
and actually decreases with table consolidation. This is 
because table consolidation performs better as more REs 
with wildcard closures are added as there are more trees 
with similar structure in the deferment forest. 


We now analyze running time. We ran our exper- 
iments on the Michigan State University High Perfor- 
mance Computing Center (HPCC). The HPCC has sev- 
eral clusters; most of our experiments were executed 
on the fastest cluster which has nodes that each have 2 
quad-core Xeons running at 2.3GHz. The total RAM for 
each node is 8GB. Fig. 10(b) shows the compute time 
per state in milliseconds. The build times are the time 
per DFA state required to build the non-overlapping set 
of transitions (applying TS and TC); these increase lin- 
early because these algorithms are quadratic in the num- 
ber of DFA states. For our largest DFA Scale 34 with 
305,339 states, the total time required for TS, TS+TC2, 
and TS+TC4 is 19.25 mins, 118.6 hrs, and 150.2 hrs, 
respectively. These times are cumulative; that is going 
from TS+TC2 to TS+TC4 requires an additional 31.6 
hours. This table consolidation time is roughly one 
fourth of the first table consolidation time because the 
number of DFA states has been cut in half by the first ta- 
ble consolidation and table consolidation has a quadratic 
running time in the number of DFA states. The BW times 
are the time per DFA state required to minimize these 
transition tables using the Bitweaving algorithm in [21]; 
these times are roughly constant as Bitweaving depends 
on the size of the transition tables for each state and is not 
dependent on the size of the DFA. For our largest DFA 
Scale 34 with 305,339 states, the total Bitweaving opti- 
mization time on TS, TS+TC2, and TS+TC4 is 10 hrs, 5 
hrs, and 2.5 hrs. These times are not cumulative and fall 
by a factor of 2 as each table consolidation step cuts the 
number of DFA states by a factor of 2. 
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7.3. Results on 7-var-stride DFAs 


We consider two implementations of variable striding 
assuming we have a 2.36 megabit TCAM with TCAM 
width 72 bits (32,768 entries). Using Table 1, the latency 
of a lookup is 2.57 ns. Thus, the potential RE matching 
throughput of by a 7-var-stride DFA with average stride 
Sis 8 x S/.00000000257 = 3.11 x S Gbps. 

In our first implementation, we only use self-loop un- 
rolling of root states in the deferment forest. Specifically, 
for each RE set, we first construct the 1-stride DFA using 
transition sharing. We then apply self-loop unrolling to 
each root state of the deferment forest to create a 7-var- 
Stride transition table. In all cases, the increase in size 
due to self-loop unrolling is tiny. The bigger issue was 
that the TCAM width doubled from 36 bits to 72 bits. 
We can decrease the TCAM space by using table con- 
solidation; this was very effective for all RE sets except 
the string matching RE sets Bro217 and C613. This was 
only necessary for Snort31. All other self-loop unrolled 
tables fit within our available TCAM space. 

Second, we apply full variable striding. Specifically, 
we first create 1-stride DFAs using transition sharing and 
then apply variable striding with no table consolidation, 
table consolidation with 2-decision tables, and table con- 
solidation with 4-decision tables. We use the best result 
that fits within the 2.36 megabit TCAM space. For the 
RE sets Bro217, C8, C613, Snort24 and Snort34, no ta- 
ble consolidation is used. For C10 and Snort31, we use 
table consolidation with 2-decision tables. For C7, we 
use table consolidation with 4-decision tables. 

We now run both implementations of our 7-var-stride 
DFAs on traces of length 287484 to compute the aver- 
age stride. For each RE set, we generate 4 traces using 
Becchi et al.’s trace generator tool using default values 
35%, 55%, 75%, and 95% for the parameter pj,. These 
generate increasingly malicious traffic that is more likely 
to move away from the start state towards distant accept 
states of that DFA. We also generate a completely ran- 
dom string to model completely uniform traffic such as 
binary traffic patterns which we treat as pyy = O. 

We group the 8 RE sets into 3 groups: group (a) repre- 
sents the two string matching RE sets Bro217 and C613; 
group (b) represents the three RE sets C7, C8, and C10 
that contain all wildcard closures; group (c) represents 
the three RE sets Snort24, Snort31, and Snort34 that con- 
tain roughly 40% wildcard closures. Fig. 11 shows the 
average stride length and throughput for the three groups 
of RE sets according to the parameter pjy (the random 
string trace is pyy = 0). 

We make the following observations. (/) Self-loop un- 
rolling is extremely effective on the uniform trace. For 
the non string matching sets, it achieves an average stride 
length of 5.97 and 5.84 and RE matching throughputs 
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Figure 11: The throughput and average stride length of 
RE sets. 


of 18.58 and 18.15 Gbps for groups (b) and (c), re- 
spectively. For the string matching sets in group (a), it 
achieves an average stride length of 3.30 and a result- 
ing throughput of 10.29 Gbps. Even though only the 
root states are unrolled, self-loop unrolling works very 
well because the non-root states that defer most transi- 
tions to a root state will still benefit from that root state’s 
unrolled self-loops. In particular, it is likely that there 
will be long stretches of the input stream that repeatedly 
return to a root state and take full advantage of the un- 
rolled self-loops. (2) The performance of self-loop un- 
rolling does degrade steadily as pyg increases for all RE 
sets except those in group (b). This occurs because as 
pm increases, we are more likely to move away from 
any default root state. Thus, fewer transitions will be 
able to leverage the unrolled self-loops at root states. (3) 
For the uniform trace, full variable striding does little 
to increase RE matching throughput. Of course, for the 
non-string matching RE sets, there was little room for 
improvement. (4) As pyy increases, full variable strid- 
ing does significantly increase throughput, particularly 
for groups (b) and (c). For example, for groups (b) and 
(c), the minimum average stride length is 2.91 for all 
values of pyg which leads to a minimum throughput of 
9.06Gbps. Also, for all groups of RE sets, the aver- 
age stride length for full variable striding 1s much higher 
than that for self-loop unrolling for large pyy. For ex- 
ample, when pyy = 95%, full variable striding achieves 
average stride lengths of 2.55, 2.97, and 3.07 for groups 
(a), (b), and (c), respectively, whereas self-loop unrolling 
achieves average stride lengths of only 1.04, 1.83, and 
1.06 for groups (a), (b), and (c), respectively. 

These results indicate the following. First, self-loop 
unrolling is extremely effective at increasing throughput 
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for random traffic traces. Second, other variable striding 
techniques can mitigate many of the effects of malicious 
traffic that lead away from the start state. 


$ Conclusions 


We make four key contributions in this paper. (1) We 
propose the first TCAM-based RE matching solution. 
We prove that this unexplored direction not only works 
but also works well. (2) We propose two fundamental 
techniques, transition sharing and table consolidation, to 
minimize TCAM space. (3) We propose variable striding 
to speed up RE matching while carefully controlling the 
corresponding increase in memory. (4) We implemented 
our techniques and conducted experiments on real-world 
RE sets. We show that small TCAMs are capable of stor- 
ing large DFAs. For example, in our experiments, we 
were able to store a DFA with 25K states in a 0.SMb 
TCAM chip; most DFAs require at most 1 TCAM entry 
per DFA state. With variable striding, we show that a 
throughput of up to 18.6 Gbps is possible. 
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Abstract 


Search engines not only assist normal users, but also pro- 
vide information that hackers and other malicious enti- 
ties can exploit in their nefarious activities. With care- 
fully crafted search queries, attackers can gather infor- 
mation such as email addresses and misconfigured or 
even vulnerable servers. 

We present SearchAudit, a framework that identifies 
malicious queries from massive search engine logs in or- 
der to uncover their relationship with potential attacks. 
SearchAudit takes in a small set of malicious queries as 
seed, expands the set using search logs, and generates 
regular expressions for detecting new malicious queries. 
For instance, we show that, relying on just 500 malicious 
queries as seed, SearchAudit discovers an additional 4 
million distinct malicious queries and thousands of vul- 
nerable Web sites. In addition, SearchAudit reveals a 
series of phishing attacks from more than 400 phishing 
domains that compromised a large number of Windows 
Live Messenger user credentials. Thus, we believe that 
SearchAudit can serve as a useful tool for identifying and 
preventing a wide class of attacks in their early phases. 


1 Introduction 


With the amount of information in the Web rapidly grow- 
ing, the search engine has become an everyday tool for 
people to find relevant and useful information. While 
search engines make online browsing easier for normal 
users, they have also been exploited by malicious entities 
to facilitate their various attacks. For example, in 2004, 
the MyDoom worm used Google to search for email ad- 
dresses in order to send spam and virus emails. Recently, 
it was also reported that hackers used search engines to 
identify vulnerable Web sites and compromised them im- 
mediately after the malicious searches [20, 16]. These 
compromised Web sites were then used to serve malware 
or phishing pages. 
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Indeed, by crafting specific search queries, hackers 
may get very specific information from search engines 
that could potentially reveal the existence and locations 
of security flaws such as misconfigured servers and vul- 
nerable software. Furthermore, attackers may prefer us- 
ing search engines because it is stealthier and easier than 
setting up their own crawlers. 

The identification of these malicious queries thus pro- 
vides a wide range of opportunities to disrupt or prevent 
potential attacks at their early stages. For example, a 
search engine may choose not to return results to these 
malicious queries [20], making it harder for attackers to 
obtain useful information. In addition, these malicious 
queries could provide rich information about the attack- 
ers, including their intentions and locations. Therefore, 
strategically, we can let the attackers guide us to better 
understand their methods and techniques, and ultimately, 
to predict and prevent followup attacks before they are 
launched. 

In this paper, we present SearchAudit, a suspicious- 
query generation framework that identifies malicious 
queries by auditing search engine logs. While auditing is 
often an important component of system security, the au- 
diting of search logs is particularly worthwhile, both be- 
cause authentication and authorization (two other pillars 
of security [14]) are relatively weak in search engines, 
and because of the wealth of information that search en- 
gines and their logs contain. 

Working with SearchAudit consists of two stages: 
identification and investigation. In the first stage, 
SearchAudit identifies malicious queries. In the second 
stage, with SearchAudit’s assistance, we focus on ana- 
lyzing those queries and the attacks of which they are 
part. 

More specifically, in the first stage, SearchAudit takes 
a few known malicious queries as seed input and tries 
to identify more malicious queries. The seed can be ob- 
tained from hacker Web sites [1], known security vul- 
nerabilities, or case studies performed by other security 


19th USENIX Security Symposium —= 127 


128 


researchers [16]. As seed malicious queries are usu- 
ally limited in quantity and restricted by previous dis- 
coveries, SearchAudit monitors the hosts that conducted 
these malicious queries to obtain an expanded set of 
queries from these hosts. Using the expanded set of 
queries, SearchAudit further generates regular expres- 
sions, which are then used to match search logs for iden- 
tifying other malicious queries. This step is critical as 
malicious queries are typically automated searches gen- 
erated by scripts. Using regular expressions offers us the 
opportunity to catch a large number of other queries with 
a similar format, possibly generated by such scripts. 

After identifying a large number of malicious queries, 
in stage two, we analyze the malicious queries and the 
correlation between search and other attacks. In particu- 
lar, we ask questions such as: why do attackers use Web 
search, how do they leverage search results, and who are 
the victims. Answers to these questions not only help 
us better understand the attacks, but also provide us an 
opportunity to protect or notify potential victims before 
the actual attacks are launched, and hence stop attacks in 
their early stages. 

We apply SearchAudit to three months of sampled 
Bing search logs. As search logs contain massive 
amounts of data, SearchAudit is implemented on the 
Dryad/DryadLINQ [11, 26] platform for large-scale data 
analysis. It is able to process over 1.2TB of data in 7 
hours using 240 machines. 

To our knowledge, we are the first to present a system- 
atic approach for uncovering the correlations between 
malicious searches and the attacks enabled by them. Our 
main results include: 


e Enhanced detection capability: Using just 500 seed 
queries obtained from one hacker Web site, SearchAu- 
dit detects another 4 million malicious queries, some 
even before they are listed by hacker Web sites. 


e Low false-positive rates. Over 99% of the captured 
malicious queries display multiple bot features, while 
less than 2% of normal user queries do. 


e Ability to detect new attacks: While the seed queries 
are mostly ones used to search for Web site vulnerabil- 
ities, SearchAudit identifies a large number of queries 
belonging to a different type of attack—forum spam- 
ming. 

e Facilitation of attack analysis: SearchAudit helps 
identify vulnerable Web sites that are targeted by at- 
tackers. In addition, SearchAudit helps analyze a se- 
ries of phishing attacks that lasted for more than one 
year. These attacks set up more than 400 phishing do- 
mains, and tried to steal a large number of Windows 
Live Messenger user credentials. 


The rest of the paper is organized as follows. We 
start with reviewing related work in Section 2. Then 
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we present the architecture of SearchAudit in Section 3. 
As SearchAudit contains two stages, Section 4 focuses 
on the results of the first stage—presenting the mali- 
cious queries identified, and verifying that they are in- 
deed malicious. Section 5 describes the second stage 
of SearchAudit—analyzing the correlation between ma- 
licious queries and other attacks. In this paper, we study 
three types of attacks in detail: searching for vulnerable 
Web sites (Section 6), forum spamming (Section 7), and 
Windows Live Messenger phishing attacks (Section 8). 
Finally we conclude in Section 9. 


2 Related Work 


There is a significant amount of automated Web traffic 
on the Internet [5]. A recent study by Yu et al. showed 
that more than 3% of the entire search traffic may be gen- 
erated by stealthy search bots [25] . 

One natural question to ask is: what is the motivation 
of these search bots? While some search bots have legit- 
imate uses, e.g., by search engine competitors or third 
parties for studying search quality [8, 17], many oth- 
ers could be malicious. It 1s widely known that attack- 
ers conduct click fraud for monetary gain [7, 10]. Re- 
cently, researchers have associated malicious searches 
with other types of attacks. For example, Provos et 
al. reported that worms such as MyDoom.O and Santy 
used Web search to identify victims for spreading infec- 
tion [20]. Also, Moore et al. [16] identified four types of 
evil searches and showed that some Web sites were com- 
promised shortly after evil searches. They showed that 
attackers searched for keywords like “phpizabi v0.848b 
cl hfpl” to gather all the Web sites that have a known 
PHP vulnerability [9]. Subsequently these vulnerable 
Web servers were compromised to set up phishing pages. 

Besides email spamming and phishing, there are many 
other types of attacks, e.g., malware propagation and 
Denial of Service (DoS) attacks. Although there are a 
wealth of attack-detection approaches, most of these at- 
tacks were studied in isolation. Their correlations, espe- 
cially to Web searches, have not been extensively stud- 
ied. In this paper, we aim to take a step towards a system- 
atic framework to unveil the correlations between mali- 
cious searches and many other attacks. 

In SearchAudit, we derive regular expression patterns 
for matching malicious queries. There are many exist- 
ing signature-generation techniques for detecting worms 
and spam emails such as Polygraph [18], Hamsa [15], 
Autograph [12], Earlybird [21], Honeycomb [13], Ne- 
man [24] Vigilante [6], and AutoRE [23]. Some of these 
approaches are based on semantics, e.g., Neman and Vig- 
ilante, and hence are not suitable for us, since query 
strings do not have semantic information. The remain- 
ing content-based signature-generation schemes, Hon- 
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eycomb, Polygraph, Hamsa, and AutoRE, can generate 
string tokens or regular expressions. These are more ap- 
pealing to us since attackers add random keywords to 
query strings, and we want the generated signatures to 
capture this polymorphism. In this work, we choose Au- 
toRE, which generates regular expression signatures. 

In [20], Provos et al. found malicious queries from the 
Santy worm by looking at search results. In those at- 
tacks, the attackers constantly changed the queries, but 
obtained similar search results (viz., the Web servers that 
are vulnerable to Santy’s attack). SearchAudit, on the 
other hand, is primarily targeted at finding new attacks, 
of which we have no prior knowledge. SearchAudit is 
thus a general framework to detect and understand ma- 
licious searches. While there might already be propri- 
etary approaches adopted by various search engines, or 
anecdotal evidence of malicious searches, we hope that 
our analysis results can provide useful information to the 
general research community. 


3 Architecture 


Our main goal is to let attackers be our guides—to follow 
their activities and predict their future attacks. We use a 
small-sized set of seed activities to bootstrap our system. 
The seed is usually limited and restricted to malicious 
searches of which we are aware. The system then applies 
a sequence of techniques to extend this seed set in order 
to identify previously unknown attacks and obtain a more 
comprehensive view of malicious search behavior. 

Figure | presents the architecture of our system. At 
a high level, the system can be viewed as having two 
stages. In the first stage, 1t examines search query logs, 
and expands the set of seed queries to generate additional 
sets of suspicious queries. This stage is automated and 
quite general, i.e., it can be used to find different types of 
suspicious queries pertaining to different malicious ac- 
tivities. The second stage involves the analysis of these 
suspicious queries to see how different attacks are con- 
nected with search—this is mostly done manually, since 
it requires a significant amount of domain knowledge to 
understand the behavior of the different malicious enti- 
ties. This section focuses on the first stage of our system 
and Sections 6, 7, and 8 provide examples of the analysis 
done in the second stage. 

Extending the seed using query logs appears to be a 
straightforward idea. Yet, there are two challenges. First, 
hackers do not always use the same queries; they mod- 
ify and change query terms over time in order to ob- 
tain different sets of search results, and thereby identify 
new victims. Therefore, simply using a blacklist of bad 
queries is not effective. Second, malicious searches may 
be mixed with normal user activities, especially on prox- 
ies. So we need to differentiate malicious queries from 
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normal ones, though they may originate from the same 
machine or IP address. To address these challenges, we 
do not simply use the suspicious queries directly, but in- 
stead generate regular expression signatures from these 
suspicious queries. Regular expressions help us capture 
the structure of these malicious queries, which is nec- 
essary to identify future queries. We also filter regu- 
lar expressions that are too general and therefore match 
both malicious and normal queries. Using these two ap- 
proaches, the first stage of the system now consists of a 
pipeline of two steps: Query Expansion and Regular Ex- 
pression Generation. Since any set of malicious queries 
could potentially lead to additional ones, we loop back 
these queries until we reach a fixed point with respect to 
query expansion. The rest of this section presents each 
of the stages in detail. 


3.1 Query Expansion 


The first step in our system is to take a small set of seed 
queries and expand them. These seed queries are known 
to be suspicious or malicious. They could be obtained 
from a variety of sources, such as preliminary analysis of 
the search query logs or with the help of domain experts. 

Our search logs contain the following information: a 
query, the time at which the query was issued, the set of 
results returned to the searcher, and a few properties of 
the request, such as the IP address that issued the request 
and the user agent (which identifies the Web browser 
used). Since the amount of data in the search logs is mas- 
sive, we use the Dryad/DryadLINQ platform to process 
data in parallel on a cluster of hundreds of machines. 

The seed queries are expanded as follows. We run the 
seed queries through the search logs to find exact query 
matches. For each record where the queries match ex- 
actly, we extract the IP address that issued the query. We 
then go back to the search logs and extract all queries 
that were issued by this IP address. The reasoning here 
is that since this IP address issued a query that we believe 
to be malicious, it is probably that other queries from this 
IP address would also be malicious. This is because at- 
tackers typically issue not just a single query but rather 
multiple queries so as to get more search results. This 
method of expansion would allow us to capture the other 
queries issued. 

However, it must be noted that since we are using the 
IP address to expand to other queries, we need to be care- 
ful about dynamic IP addresses because of DHCP. In or- 
der to reduce the impact of dynamic IPs on our data, we 
consider only queries that were made on the same day as 
the seed query. 

At the end of this step, we have all the queries that 
were issued from suspicious IP addresses on the same 
day. 
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while |V| > 0 do 
Rmazx *- R; where R; is the regular expression 
that matches the most number of strings in V 
i Renan 
V — V— MATCHES(V, Rmaz) 


Table 1: The number of search requests, unique queries, and 
IPs for different matching techniques on the February 2009 
dataset. 


end while _ { 
wo 
return R 5 
3 go 0.8 
oy 
Regular S ° 
expression : : : ; ‘ ; 9° 
Ssine i pressions required to match all the input strings. Finding S 3 ne 
— reaiction . . . . o 
the minimal set is in fact NP-Hard [4]. Ee o4 
| Sete ; . +2 
a This ability to consolidate regular expressions has an- — 
; & 8 0.2 
Data other advantage: if the input to the regular-expression 6 
dissemination . . oa. a o < 0 
generator contains too many strings, it is split into mul- 
0 0.2 0.4 0.6 0.8 i 


Stage 2 


Figure 1: The architecture of the system is a pipeline connecting the query expansion framework, the proxy elimination, and the 


regular expression generation. 


3.2 Regular Expression Generation 


The next step after performing query expansion is the 
generation of regular expressions. We prefer regular ex- 
pressions over fixed strings for two reasons. First, they 
can potentially match malicious searches even if attack- 
ers change the search terms slightly. In our logs, we find 
that many hackers add restrictions to the query terms, 
e.g., adding “site:cn” will obtain search results in the 
.cn domain only; regular expressions can capture these 
variations of queries. Second, as many of the queries are 
generated using scripts, regular expressions can capture 
the structure of the queries and therefore can match fu- 
ture malicious queries. 


Signature Generation: We use a technique similar 
to AutoRE [23] to derive regular expressions, with a 
few modifications to incorporate additional information 
from the search domain, such as giving importance to 
word boundaries and special characters in a query. The 
regular-expression generator works as follows. First, it 
builds a suffix array to identify all popular keywords in 
the input set. Then it picks the most popular keyword 
and builds a root node that contains all the input strings 
matching this keyword. For the remaining strings, it re- 
peats the process of selecting root nodes until all strings 
are selected. These root nodes are used to start building 
trees of frequent substrings. Then the regular-expression 
generator recursively processes each tree to form a forest. 
For each tree node, the keywords on the path to the root 
construct a pattern. It then checks the content between 
keywords and places restrictions on it (e.g., [0-9]{1,3} 
to constrain the intervening content to be one to three 
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digits). In addition, for each regular expression, we com- 
pute a score that measures the likelihood that the regular 
expression would match a random string. This score is 
based on entropy analysis, as described in [23]; the lower 
the score, the more specific the regular expression. How- 
ever, a too specific regular expression would be equiva- 
lent to having an exact match, and thus loses the bene- 
fit of using the regular expression in the first place. We 
therefore need a score threshold to pick the set of regular 
expressions in order to trade off between the specificity 
of the regular expression and the possibility of it match- 
ing too many benign queries. In SearchAudit, we select 
regular expressions with score lower than 0.6. (Parame- 
ter selection is discussed in detail in Section 4.2.) 


Eliminating Redundancies: One issue with the gener- 
ated regular expressions 1s that some of them may be re- 
dundant, 1.e., though not identical, they match the same 
or similar set of queries. For example, three input strings 
query site:A, query site:B,and query may 
generate two regular expressions query.{0,7} and 
query site: .{1}. The two regular expressions have 
different coverage and scores, but are both valid. In or- 
der to eliminate redundancy in regular expressions, we 
use the REGEX_CONSOLIDATE algorithm described in 
Algorithm 1. The algorithm takes as input S, the set of 
input queries, R,,..., R,, the regular expressions, and 
returns R, the subset of input regular expressions. Here, 
the function MATCHES(S, #;) returns the strings V C S' 
that match the regular expression R;. 


We note that REGEX_CONSOLIDATE is a greedy algo- 
rithm and does not return the minimal set of regular ex- 
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tiple groups, and regular expressions are generated for 
each group separately. These regular expressions can 
then be merged together using REGEX_CONSOLIDATE. 


Eliminating Proxies: We observe that we can speed 
up the generation of regular expressions by reducing the 
number of strings fed as input to the regular-expression 
generator. However, we would like to do this without 
sacrificing the quality of the regular expressions gener- 
ated. We observe in our experiments that some of the 
seed malicious queries are performed by IP addresses 
that correspond to public proxies or NATs. These IPs are 
characterized by a large query volume, since the same 
IP is used by multiple people. Also, most of the queries 
from these IPs are regular benign queries, interspersed 
with a few malicious ones. Therefore, eliminating these 
IPs would provide a quick and easy way of decreasing 
the number of input strings, while still leaving most of 
the malicious queries untouched. 

In order to detect such proxy-like IPs, we use a sim- 
ple heuristic called behavioral profiling. Most users in 
a geographical region have similar query patterns, which 
are different from that of an attacker. For proxies that 
have mostly legitimate users, their set of queries will 
have a large overlap with the popular queries from the 
same /16 IP prefix. We label an IP as a proxy if it issues 
more than 1000 queries in a day, and if the & most pop- 
ular queries from that IP and the k most popular queries 
from that prefix overlap in m queries. (We empirically 
find k = 100 and m = 5 to work well.) Note however, 
that the proxy elimination is purely a performance opti- 
mization, and not necessary for the correct operation of 
SearchAudit. Behavioral profiling could also be replaced 
with a better technique for detecting legitimate proxies. 


Looping Back Queries: Once the regular expressions 
are generated, they are applied to the search logs in order 
to extract all queries that match the regular expressions. 
This is an enlarged set of suspicious queries. These 
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Regex Threshold 


Figure 2: Selecting the threshold for regular expression scores: 
for regular expressions having score 0.6 or less, nearly all the 
matched queries have new cookies. 


queries generated by SearchAudit can now be fed back 
into the system as new seed queries for another itera- 
tion. A discussion on the effect of looping back queries 
as seeds, and its benefits, is presented in Section 4.3.3. 


4 Stage One Results 


We apply SearchAudit to several months of search logs 
in order to identify malicious searchers. In this section, 
we first describe the data collection and system setup. 
Then we explain the process of parameter selection. Fi- 
nally, we present the detection results and verify the re- 
sults. 


4.1 Data Description and System Setup 


We use three months of search logs from the Bing search 
engine for our study: February 2009 (when it was known 
as Live Search), December 2009, and January 2010. 
Each month of sampled data contains around 2 billion 
pageviews. Each pageview records all the activities re- 
lated to a search result page, including information such 
as the query terms, the links clicked, the query IP ad- 
dress, the cookie, the user agent, and the referral URL. 
Because of privacy concerns, the cookie and the user 
agent fields are anonymized by hashing. 

The seed malicious queries are obtained from a hacker 
Web site mi lwOrm.com [1]. We crawl the site and ex- 
tract 500 malicious queries, which were posted between 
May 2006 and August 2009. 


19th USENIX Security Symposium 131 


132 


We implement SearchAudit on the Dryad/DryadLINQ 
platform, where data is processed in parallel on a clus- 
ter of 240 machines. The entire process of SearchAudit 
takes about 7 hours to process the 1.2 TB of sampled 
data. 


4.2 Selection of Regular Expressions 


As described in Section 3.2, we can eliminate proxies 
to speed up the regular expression generation. If we do 
not eliminate proxies, the input to the regular-expression 
generator can contain queries from the proxies, and there 
may be many benign queries among them. As a result, al- 
though some of the generated regular expressions may be 
specific, they could match benign queries. In this setting, 
we need to examine each regular expression individu- 
ally, and select those that match only malicious queries. 
To do this, we use the presence of old cookies to guide 
us. We observe that if we pick a random set of search 
queries (which may contain a mix of normal and mali- 
cious queries), the number of new cookies in them is sub- 
stantially low. However, for the known malicious queries 
(the seed queries), it is close to 100%, because most au- 
tomated traffic either does not enable cookies or presents 
invalid cookies. (In both these cases, a new cookie is 
created by the search engine and assigned to the search 
request.) Of course, cookie presence is just one feature 
of regular user queries. We can use other features as well, 
as discussed in Section 4.5. 

If proxies are eliminated, the remaining queries are 
from the attackers’ IPs, and we find that most of them are 
malicious. In this case, we can simply use a threshold 
to pick regular expressions based on their scores. This 
threshold represents a trade-off between the specificity of 
the regular expression and the possibility of it being too 
general and matching too many random queries. Again, 
we use the number of new cookies as a metric to guide us 
in our threshold selection. Figure 2 shows the relation- 
ship between the regular expression score and the per- 
centage of new cookies in the queries matched by the 
regular expressions. We see empirically that expressions 
with scores lower than 0.6 have a very high fraction of 
new cookies (> 99.85%), similar to what we observe 
with the seed malicious queries. On the other hand, regu- 
lar expressions with score greater than 0.6 match queries 
where the fraction of new cookies is similar to what we 
see for a random sampling of user queries; therefore it 
is plausible that these regular expressions mostly match 
random queries that are not necessarily malicious. 

In our tests, proxy elimination filters most of the be- 
nign queries, but less than 3% of the unique malicious 
queries (using cookie-age as the indicator). Therefore 
it has little effect on the generated regular expressions. 
Consequently, all the results presented in the paper are 
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Seed Queries Used 


100 queries (pre-2009) 100% 


Random 50% 98.50% 
Random 25% 88.50% 





Table 2: Malicious query coverage obtained when using differ- 
ent subsets of the seed queries. 


with the use of proxy elimination. We choose 0.6 as the 
regular expression threshold, and this ends up picking 
about 20% of the generated regular expressions. 


4.3. Detection Results 


We now present results obtained from running 
SearchAudit, and show how each component con- 
tributes to the end results. 


4.3.1 Effect of Query Expansion and Regular Ex- 
pression Matching 


We feed the 500 malicious queries obtained from 
milwOrm.com into SearchAudit, and examine the 
February 2009 dataset. Using exact string match, we 
find that 122 of the 500 queries appear in the dataset, and 
we identify 174 IP addresses that issued these queries. 
Many of these queries are submitted from multiple IP 
addresses and many times, presumably to fetch multi- 
ple pages of search results. In all, there are 122,529 such 
queries issued by these IP addresses to the search engine. 
Then we use the query expansion module together with 
the proxy elimination module of SearchAudit and obtain 
800 unique queries from 264 IP addresses. Finally we 
run these queries through the regular expression genera- 
tion engine. 

Table 1 quantifies the number of additional queries 
SearchAudit identifies by the use of query expansion 
and regular expression generation. Using regular expres- 
sion matching, SearchAudit identifies 3,560 distinct ma- 
licious queries from 1001 IP addresses. Compared to 
exact matching of the seed queries, regular-expression- 
based matching increases the number of unique queries 
found by almost a factor of 30. We also find 4 times more 
attacker IPs. Thus using regular expressions for match- 
ing provides significant gains. 


4.3.2 Effect of Incomplete Seeds 


Seed queries are inherently incomplete, since they are a 
very small set of known malicious queries. In this sec- 
tion, we look at how much coverage SearchAudit contin- 
ues to get when the number of seed queries is decreased. 

First, we split the 122 seed queries into two sets: 100 
queries that were first posted on milwOrm.com before 
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% Queries 


No loopback 1,001 297,181 0.15% 


Loopback 1 39,969 8,992,839 0.87% 
Loopback 2 40,318 9,001,737 0.96% 
Loopback 3 41,301 9,028,143 0.97% 





Table 3: The number of IPs and queries captured by SearchAu- 
dit in the February 2009 dataset, with and without looping back. 


2009, and the remaining 22 that were posted in 2009. We 
then use the 100 queries as our seed, and run SearchAudit 
on the same search log for a week in February 2009. We 
find that the queries generated by SearchAudit recover 
all the 122 seed queries. Therefore SearchAudit is ef- 
fective in finding the malicious queries even before they 
are posted on the Web site; in fact we find queries in the 
search logs several months before they are first posted on 
the Web site. 

Next, we choose a random subset of the original seed 
queries. With 50% of the randomly selected seed queries, 
our coverage 1s 98.5% out of the 122 input seed queries; 
and using just 25% of the seed queries, we can obtain 
88.5% of the queries. These results are summarized in 
Table 2. 


4.3.3 Looping Back Seed Queries 


After SearchAudit is bootstrapped using malicious 
queries, it uses the derived regular expressions to gen- 
erate a steady stream of queries that are being performed 
by attackers. SearchAudit uses these as new seeds to gen- 
erate additional suspicious queries. Each such set of sus- 
picious queries can subsequently be fed back as new seed 
input to SearchAudit, until the system reaches a fixed 
point, or until the marginal benefit of finding more such 
queries outweighs the cost. 

To measure when this fixed point would occur, we use 
the February 2009 dataset, and run SearchAudit multiple 
times, each time taking the output from the previous run 
as the seed input. For the first run, we use the 500 seed 
queries obtained from mi lwOrm. com. 

Table 3 summarizes our findings. We see that, as ex- 
pected, the number of queries captured increases when 
the generated queries are looped back as new seeds. 
Also, the number of queries that have valid cookies re- 
mains quite small throughout (< 1%), suggesting that 
the new queries generated through the loopback are sim- 
ilar to the seed queries and the queries generated in the 
first round. We observe that looping back once signifi- 
cantly increases the set of queries and IPs captured (from 
1001 IPs to almost 40,000 IPs), but subsequent iterations 
do not add much information. 

Therefore, we restrict SearchAudit to loop back the 
generated queries as seeds exactly once. 
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Dataset IPs Total Queries |Unig. Queries 


Feb-2009 39,969 8,992,839 542,505 
Dec-2009 29,364 5,824,212 3,955,244 
Jan-2010 42,833 2,846,703 422,301 





Table 4: The number of search requests, unique queries, and 
IPs captured by SearchAudit in the different datasets. 


4.3.4 Overall Matching Statistics 


Putting it all together, 1.e., using regular expression 
matching and loopback, Table 4 shows the number of 
IPs, total queries, and distinct queries that SearchAudit 
identifies in each of the datasets. Overall, SearchAu- 
dit identifies over 40,000 IPs issuing more than 4 mil- 
lion malicious queries, resulting in over 17 million 
pageviews. One interesting point to note here is the sig- 
nificant spike in the number of unique queries found in 
the December dataset. The reason for this spike is the 
presence of a set of attacker IPs that do not fetch multiple 
result pages for a query, but instead generate new queries 
by adding a random dictionary word to the query, thereby 
increasing the number of distinct queries we observe. 


4.4 Verification of Malicious Queries 


Next, we verify that the queries identified by SearchAu- 
dit are indeed malicious queries. As we lack ground truth 
information about whether a query is malicious or not, 
we adopt two approaches. The first is to check whether 
the query is reported on any hacker Web sites or secu- 
rity bulletins. The second is to check query behavior— 
whether the query matches individual bot or botnet fea- 
tures. 

For each query g returned by SearchAudit, we issue a 
query “g AND (dork OR vulnerability)” to the search en- 
gine, and save the results. Here, the term “dork” is used 
by attackers to represent malicious searches. We add the 
terms “dork” and “vulnerability” to the query to help us 
find forums and Web sites that discuss these queries. We 
then look at the most popular domains appearing in the 
search results across multiple queries. Domains that list 
a large number of malicious searches from our set are 
likely to be security forums, blogs by security companies 
or researchers, or even hacker Web sites. These can now 
be used as new sources for finding more seed queries. 
We manually examine 50 of these Web sites, and find that 
around 60% of them are security blogs or advisories. The 
remaining 40% are in fact hacker forums. In all, 73% of 
the queries reported by SearchAudit contain search re- 
sults associated with these 50 Web sites. 

Next we look at two sets of behavioral features that 
would indicate whether the query is automated, and 
whether a set of queries was generated by the same 
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script. The first set of features applies to individual bot- 
generated queries, e.g., not clicking any link. They indi- 
cate whether a query is likely to be scripted or not. The 
second set of features relates to botnet group properties. 
In particular, they quantify the likelihood that the differ- 
ent queries captured by a particular regular expression 
were generated by the same (or similar) script. 

Note that although these behavior features could dis- 
tinguish bot queries from human-generated ones, they 
are not robust features because attackers can easily use 
randomization or change their behavior if they know 
these features. In this work, we use these behavior fea- 
tures only for validation rather than relying on them to 
detect malicious queries. 


4.4.1 Verification of Queries Generated by Individ- 
ual Bots 


To distinguish bot queries from those generated by hu- 
man users, we select the following features: 


e Cookie: This is the cookie presented in the search re- 
quest. Most bot queries do not enable cookies, result- 
ing in an empty cookie field. For normal users who 
do not clear their cookies, all the queries carry the old 
cookies. 


e Link clicked: This records whether any link in the 
search results was clicked by the user. Many bots do 
not click any link on the result page. Instead, they 
scrape the results off the page. 


We compare queries returned by SearchAudit with 
queries issued by normal users for popular terms such 
as facebook and craigslist. Table 5 and Table 6 
show the comparison results. We see that for SearchAu- 
dit returned queries, 98.8% of them disable cookies, as 
opposed to normal users, where only 2.7% disable cook- 
ies. We also see that on average, all the queries in a group 
returned by SearchAudit had no links clicked. On the 
other hand, for normal users, over 85% of the searches 
resulted in clicks. All these common features suggest 
that the queries returned by SearchAudit are highly likely 
to be automated or scripted searches, rather than being 
submitted by regular users. 


4.4.2 Verification of Queries Generated by Botnets 


Having shown that individual queries identified by 
SearchAudit display bot characteristics, we next study 
whether a set of queries matched by a regular expression 
are likely to be generated by the same script, and hence 
the same attacker (or botnet). For all the queries matched 
by a regular expression, we look at the behavior of each 
IP address that issued the queries. If most of the IP ad- 
dresses that issued these queries exhibit similar behavior, 
then it is likely that all these IPs were running the same 
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script. We pick the following four features that are rep- 
resentative of querying behavior: 


e User agent: This string contains information about the 
browser and the version used. 


e Metadata: This field records certain metadata that 
comes with the request, e.g., where the search was is- 
sued from. 


Some botnets use a fixed user agent string or metadata, 
or choose from a set of common values. For each group, 
we check the percentage of IP addresses that have identi- 
cal values or identical behavior, e.g., changing value for 
each request. If over 90% of the IPs show similar behav- 
ior, we infer that IPs in this group might have used the 
same script. 


e Pages per query: This records the number of search 
result pages retrieved per query. 


e Inter-query interval: This denotes the time between 
queries issued by the same IP. 


Queries generated by the same script may retrieve a 
similar number of result pages per query or have a simi- 
lar inter-query interval. For these two features, we com- 
pute median value for each IP address and then check 
whether there is only a small spread in this value across 
IP addresses (< 20%). This allows us to infer whether 
the different IPs follow the same distribution, and so be- 
long to the same group. 

Table 7 and Table 8 show the comparison between ma- 
licious queries and regular query groups. We see that 
for query groups returned by SearchAudit, a significant 
fraction of the queries agree on the metadata feature. For 
regular users, one usually observes a wide distribution of 
metadata. We see a similar trend in the user-agent string 
as well. For regular users, the user-agent strings rarely 
match, while for suspicious queries, more than half of 
them share the same user-agent string. With respect to 
the number of pages retrieved per search query, we see 
that regular users typically take only the first page re- 
turned. On the other hand, groups captured by SearchAu- 
dit fetch on average around 15 pages per query. This 
varies quite a bit across groups, with many groups fetch- 
ing as few as 5 pages per query, and several groups fetch- 
ing as many as 100 pages for a single query. 

The average inter-query interval for normal users 1s 
over 2.5 hours between successive queries. On the other 
hand, the average inter-query interval for bot queries is 
only 7 seconds, with most of the attackers submitting the 
queries every second or two. A few stealthy attackers 
repeated search queries at a much slower rate of once 
every 3 minutes. 

For each regular expression group, we sum up the bot- 
net features that it matches. Figure 3 shows the distri- 
bution. A majority (87%) of the groups have at least 
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Fraction of Queries 
Field within a Group with 
Same Value 


Cookie enabled = false 87.50% 
Link clicked = false 99.90% 





Table 5: The fraction of search queries within each regular ex- 


pression group agreeing on the value of each field. 


Fraction of Queries 
Feature within a Group 
with Same Value 


Inter-query interval 





Table 7: The fraction of search queries within each SearchAudit 


regular expression group agreeing on botnet features. 
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Figure 3: Graph showing the fraction of regular expressions 
that match one or more botnet features. 


one similar botnet feature and 69% of them have two or 
more features, suggesting that the queries captured by 
SearchAudit are probably generated by the same script. 


4.5 Discussion 


Network security can be an arms race and the generated 
regular expressions can become obsolete [20]. However, 
we believe that the signature-based approach is still a vi- 
able solution, especially if we have good seed queries. In 
the paper, we show that even a few hundred seed queries 
can help identify millions of malicious queries. In ad- 
dition, SearchAudit can also identify new hackers’ fo- 
rums or security bulletins that can be used as additional 
sources for seed queries. As long as there are a few IP 
addresses participating in different types of attacks, the 
query expansion framework of SearchAudit can be used 
to follow attackers and capture new attacks. 

With the publication of the SearchAudit framework, 
attackers may try to work around the system and hide 
their activities. Attackers may try to mix the malicious 
searches with normal user traffic to trick SearchAudit to 


USENIX Association 


Fraction of Queries 
Field within a Group with 
Same Value 


Cookie enabled = false 2.70% 
Link clicked = false 14.23% 





Table 6: The fraction of search queries by normal users agreeing 


on the value of each field. 


Fraction of Queries 
Feature within a Group 
with Same Value 


4.02% 
Metadata 21.8 


Inter-query interval 





9275.5 seconds 


Table 8: The fraction of search queries by normal users agreeing 


on botnet features. 


conclude that they are using proxy IP addresses. This 
is hard because behavior profiling requires attackers to 
submit queries that are location sensitive and also time 
sensitive. As many attackers use botnets to hide them- 
selves, their IP addresses are usually spread all over the 
world, making it a challenging task to come up with nor- 
mal user queries in all regions. In addition, as we men- 
tioned in Section 3, proxy elimination is an optimization 
and it can be disabled. In such settings, both the normal 
queries and malicious queries can generate regular ex- 
pressions. But the regular expressions of normal queries 
will be discarded because they match many other queries 
from normal users. 


Attackers may also try to add randomness to the 
queries to escape regular expression generation. The reg- 
ular expression engine looks at frequently occurring key- 
words to form the basis of the regular expression. There- 
fore, even if one attacker can manage not to reuse key- 
words for multiple queries, he has no control over other 
attackers using a similar query with the same keyword. 
An attacker may also simply avoid using a keyword, but 
since the query needs to be meaningful in order to get 
relevant search results, this approach would not work. 


In this work, we use the presence of old cookies to 
help us choose regular expressions that are more likely 
to be malicious; old cookies are a feature associated with 
normal benign users. We use the cookies as a marker for 
normal users because it is very simple, and works well 
in practice. If the attackers evolve and start to use old 
cookies, possibly by hijacking accounts of benign users, 
we can rely on other features such as the presence of a 
real browser, long user history, actual clicking of search 
results, or other attributes such as user credentials. 
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Even if a particular attacker is very careful and man- 
ages to escape detection, if other attackers are less careful 
and use similar queries and get caught by SearchAudit, 
the careful attacker should still be found. 


5 Stage 2: Analysis of Detection Results 


In this section, we move on to the second stage of 
SearchAudit: analyzing malicious queries and using 
search to study the correlation between attacks. 

The detected suspicious queries were submitted from 
more than 42,000 IP addresses across the globe. Large 
countries such as USA, Russia, and China are respon- 
sible for almost half the IPs issuing malicious queries. 
Looking at the number of queries issued from each IP, 
we find a large skew: 10% of the IPs are responsible for 
90% of the queries. 

SearchAudit generates around 200 regular expres- 
sions. ‘Table 9 lists ten example regular expressions, 
ordered by their scores. As we can see, the lower the 
score, the more specific the regular expression is. The 
last one .{1,25}comment.{2,21} is an example of a 
discarded regular expression, with a score 0.78. It is very 
generic (searching for string comment only) and hence 
may cause many false positives. 

By inspecting the generated regular expressions and 
the corresponding query results, we identify two asso- 
ciated attacks: finding vulnerable Web sites and forum 
spamming. We describe them next. 


Vulnerable Web sites: When searching for vulnerable 
servers, attackers predominantly adopt two approaches: 


1. They search within the structure of URLs to find 
ones that take particular arguments. For example, 


index.php?content=[*?=#+;&:]{1,10} 


searches for Web sites that are generated by PHP 
scripts and take arguments (content=). Attackers 
then try to exploit these Web sites by using specially 
crafted arguments to check whether they have pop- 
ular vulnerabilities like SQL injection. 


2. They perform malicious searches that are targeted, 

focusing on particular software with known vulner- 
abilities. 
We see many malicious queries that start with 
"Powered by" followed by the name of the soft- 
ware and version number, searching for known vul- 
nerabilities in some version of that software. 


Forum spamming: The second category of malicious 
searches are those that do not try to compromise Web 
sites. Instead, they are aimed towards performing certain 
actions on the Web sites that are generated by a particular 
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piece of software. The most common goal is Web spam- 
ming, which includes spamming on blogs and forums. 
For example, a regular expression 
"/includes/joomla.php" site:.[a-zA-Z]{2, 3} 
searches for blogs generated by the Joomla software. 
Attackers may have scripts to post spam to such blogs or 
forums. 


Windows Live Messenger phishing: Besides iden- 
tifying malicious searches generated by attackers, 
SearchAudit is also useful to study malicious searches 
triggered by normal users. In April 2009, we noticed in 
our search logs a large number of queries with the key- 
word party, generated by a series of Windows Live 
Messenger phishing attacks [25]. We see these queries 
because the users are redirected by the phishing Web 
site to pages containing the search results for the query. 
Since the queries are triggered by normal users compro- 
mised by the attack, expanding the queries by IP address 
will not gain us any information. In this case we use 
SearchAudit only to generate regular expressions to de- 
tect this series of phishing attacks. 

In the next three sections, we study these three attacks 
(compromise of vulnerable Web sites, forum spamming, 
and Windows Live Messenger phishing) in detail. We 
aim to answer questions such as how do attackers lever- 
age malicious searches for launching other attacks, how 
do attacks propagate and at what scale do they operate, 
and how can the results of SearchAudit be used to better 
understand and perhaps stop these attacks in their early 
Stages. 


6 Attack 1: Identifying Vulnerable Web 
Sites 


As vulnerable Web sites are typically used to host phish- 
ing pages and malware, we start with a brief overview 
of phishing and malware attacks before describing how 
malicious searches can help find vulnerable Web sites. 


6.1 Background of Phishing/Malware At- 
tacks 


A typical phishing attack starts with an attacker search- 
ing for vulnerable servers by either crawling the Web, 
probing random IP addresses, or searching the Web with 
the help of search engines. After identifying a vulner- 
able server and compromising it, the attacker can host 
malware and phishing pages on this server. Next, the 
attacker advertises the URL of the phishing or malware 
page through spam or other means. Finally, if users are 
tricked into visiting the compromised server, the attacker 
can conduct cyber crimes such as stealing user creden- 
tials and infecting computers. 
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Regular Expression | Score | 


"/includes/joomla\.php" site:\.[a-zA-Z]{2,3} 


"/includes/class item\.php" site: [*?=#+@;&:]{2,4} 


“php-nuke" site: [*?=#+@;&:]{2,4} 


"“modules\.php\?op=modload" site:\.[a-zA-Z0-9]{2,6} 

"[ *?=#+@;&:]{0,1l}index\.php\?content=[ *?=#+@;&:]{1,10} 
“powered by xoopsgallery" [%*?=#+@;&:]{0,23}site: [a-zA-Z]{2,3} 
"[ *?=#+@:&:]{0,12}\?page=shop\.browse".{0,9} 
-{0,8}index\.php\?option=com_.{3,17} 
[*?=#+@3&:]{0,3}webcalendar vl1\..{3,17} 


-{1,25}comment.{2,21} 
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Table 9: Example regular expressions and their scores. The last row is an example of a regular expression that is not selected 


because it is not specific enough. 


Currently, phishing and malware detection happens 
only after the attack is live, e.g., when an anti-spam 
product identifies the URLs in the spam email, when a 
browser captures the phishing content, or when anti-virus 
software detects the malware or virus. Once detected, 
the URL is added to anti-phishing blacklists. However, it 
is highly likely that some users may have already fallen 
victim to the phishing scam by the time the blacklists are 
updated. 


6.2 Applications of Vulnerability Searches 


With SearchAudit, we can potentially detect phish- 
ing/malware attack at the very first stage, when the at- 
tacker is searching for vulnerabilities. We might even 
proactively prevent servers from getting compromised. 
To obtain the list of vulnerable Web sites, we sample 
5,000 queries returned by SearchAudit. For every query 
q We issue a query “g -dork -vulnerability” to the search 
engine and record the returned URLs. Here we explicitly 
exclude the terms “dork” and “vulnerabilities” because 
we do not want results that point to security forums or 
hacker Web sites that discuss and post the vulnerability 
and the associated “dork”. Using this approach, we ob- 
tain 80,490 URLs from 39,475 unique Web sites. 
Ideally, we would like to demonstrate that most of 
these Web sites are vulnerable. Since there does not 
exist a complete list of vulnerable Web sites to com- 
pare against, we use several methods for our validation. 
First, we compare this list and a list of random Web sites 
against a list of known phishing or malware sites, and 
show that the sites returned by SearchAudit are more 
likely to appear in phishing or malware blacklists. Sec- 
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Figure 4: The fraction of search results that were present in 
phishing/malware feeds for each query. 


ond, we test and show that many of these sites indeed 
have SQL injection vulnerabilities. 


6.2.1 Comparison Against Known Phishing and 
Malware Sites 


For the potentially vulnerable Web sites obtained from 
the malicious queries, we check the presence of these 
URLs in known anti-malware and anti-phishing feeds. 
We use two blacklists: one obtained from PhishTank [2] 
and the other from Microsoft. In addition, we submit 
these queries to the search engine again at the time of 
our experiments in order to obtain the latest results. 

In both cases, the results are similar: 3-4% of the do- 
mains listed in the search results of malicious queries are 
in the anti-phishing blacklists, and 1.5% of them are in 
the anti-malware blacklist. In total, around 5% of the 
domains appear in one or more blacklists. This is signif- 
icantly higher than other classes of Web sites we consid- 
ered. 
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Not all malicious queries may be equally good at find- 
ing vulnerable servers. Figure 4 shows the distribution 
of compromised search results across queries. For the 
top 10% of the queries, at least 15% of the search results 
appear in the blacklists. 


6.2.2 SQL Injection Vulnerabilities 


Next, we show that a subset of these Web sites do indeed 
have vulnerabilities. Given that SQL injection is a popu- 
lar attack, since many Web sites use database backends, 
we test for SQL vulnerabilities. 

The best way to prove that a server has SQL injec- 
tion vulnerabilities would be to actually compromise the 
server; however, we were not comfortable with doing 
this. Instead, we limit ourselves to checking if the in- 
puts appear to be sanitized by performing the following 
study. For the malicious queries, we look at the search 
results and crawl all of the links twice. For each link, the 
first time we craw] the link as is, and the second time we 
add a single quote (’) to the first argument to test whether 
the server sanitizes the argument correctly. Note that we 
consider URLs that take an argument. We then compare 
the Web pages obtained from the successive crawls. If 
the two pages are identical, then it suggests that the in- 
put arguments are being properly sanitized, so there is 
no obvious SQL injection vulnerability. However, if the 
pages are different, it does not necessarily mean that the 
input is not being sanitized—it could just be an adver- 
tisement that changes with each access. Instead, we look 
at the di ff between the two pages, and check whether 
the second page contains any kind of SQL error. If there 
is an SQL error in the second page, but not in the first, it 
shows that the input string is not being filtered properly. 
While the presence of unsanitized inputs does not guar- 
antee SQL injection vulnerabilities, it is nevertheless a 
strong indicator. 

We examine a sample of 14,500 URLs obtained from 
the results of malicious queries, and find that 1,760 URLs 
(12%) do not sanitize the input strings and therefore may 
be vulnerable to SQL injection. Note that this is a conser- 
vative estimate since these URLs only account for Web 
sites that take arguments in the URL. Other Web sites 
that take POST arguments or have input forms on their 
pages could also be susceptible to SQL injection attacks. 


7 Attack 2: Forum-Spamming Attacks 


Using the seed queries from mi1lw0rm (which were for 
the purpose of finding vulnerable Web sites), SearchAu- 
dit additionally identifies forum-spamming attacks. In 
this section, we study the forum-spamming searches in 
detail. 
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Dataset Forum-Searching IPs| Total Searches 


February 2009 22,466 5,828,704 


December 2009 20,309 1,130,337 
January 2010 31,071 567,445 





Table 11: Stats on forum-searching IPs. 


7.1 Attack Process 


Forum spamming is an effective way to deliver spam 
messages to a large audience. In addition, it may be used 
as a technique to boost the page rank of Web sites. To do 
sO, spammers insert the URL of the target Web site that 
they want to promote in a spam message. By posting 
the message in many online forums, the target Web site 
would have a high in-degree of links, possibly resulting 
in a high page rank. 

While there are several studies on the effect of forum 
spamming [19, 22], this section focuses on exploring the 
ways spammers perform forum spamming. In particu- 
lar, we show how they discover a large number of forum 
pages in the first place. 

Table 10 shows a few example forum-related queries 
captured by SearchAudit. There are two types of 
queries: the first being general like “post a new topic’, 
and the second being more specific, tailored for a par- 
ticular piece of software. For example,“UBBCode: 
! JoomlaComment” searches for pages generated by 
the JoomlaComment software. For both types of queries, 
random keywords are added to increase the search cov- 
erage. The randomness is especially useful if spammers 
use botnets, as each bot will get different query results 
and they can focus on spamming different forums in par- 
allel. 


7.2 Attack Scale 


From the regular expressions generated by SearchAudit, 
we manually identified 46 regular expressions that are 
associated with forum spamming. Using these regular 
expressions, we proceeded to study the matched queries 
and IP addresses. Table 7.2 shows that the number of IPs 
used for forum searching stayed quite constant in 2009, 
but in 2010, the number of IP addresses increased by 
50%. 

Most IPs have transient behavior. Comparing the 
IPs in December 2009 to those in January 2010, only 
3115 (10-15%) IPs overlap. This shows that the forum- 
spamming hosts either change frequently, or may reside 
on dynamic IP ranges and hence their IPs change over 
time. Both these possibilities suggest that they are likely 
to be botnet hosts. In fact, when we apply the group 
similarity tests to check botnet behavior (defined in Sec- 
tion 4.4.2), all forum groups have at least one group sim- 
ilarity features. 
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Regular Expression 


[A?=#+@;&:]{2,7} "Commenta" JoomlaComment -""#R# 


[A?=#+@;&:]{6,11} "ips, inc" 
[\?=#+@;&:]{1,8} "Message:" photogallery#R# 


[\?=#+@;&:]{1,9} "Be first to comment this article" 
akocomment#R# 


[A?=#+@;&:]{1,6} "UBBCode:" loomlaComment -""#R# 


We aren't responsible for their content\." sections#R# 


Group Targeted Forum 
Similarity | Generation Software 
Features 


jaiso [4 | eocers 


i ewer | 


[4 ?=#+@;&:]{1,8} "The comments are owned by the poster\. PHP-Nuke, Xoops, etc. 


[a-zA-Z]{4,12} post new topic 1028 phpBB, Gallery, etc 


[A 2?=#+@;&:]{5,13} Board Statistics.{0,10} 


Invision Power Board 
(IP.Board), MyBB, etc. 


Table 10: Example regular expressions related to forum searches, their scale, and the targeted forum generation software. 
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Figure 5: CDF of the distribution of queries among IPs based 
on the query volume. 


It is interesting to note that, although the number of 
IPs increased, the total number of queries decreased. As 
shown in Figure 5, IPs are becoming more stealthy. In 
February 2009, more than 80% of forum queries were 
originated from very aggressive IPs that submitted thou- 
sands of queries per IP. Those IPs could be spammers’ 
own dedicated machines. In Jan 2010, less than 20% of 
forum queries are from aggressive IPs. The majority of 
the queries are from IPs that search at a low rate. 


7.3. Applications of Forum Searching 
Queries 


Knowledge of forum-searching IPs and query search 
terms can be used to help filter forum spam. After a ma- 
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Figure 6: Fraction of IP addresses appearing in the Project 
Honey Pot list vs. the forum group size. 


licious search, we can follow the search result pages to 
clean up the spam posts. More aggressively, even before 
the malicious search, by recognizing the malicious query 
terms or the malicious IP addresses, search engines could 
refuse to return results to the spammers. Web servers 
could also refuse connections from IPs that are known to 
search for forums. 

We validate the forum-spamming IPs using Project 
Honey Pot [3]. Project Honey Pot is a distributed hon- 
eypot network that aims to identify Web spamming. Par- 
ticipating Web sites embed a piece of software that dy- 
namically generates a page containing a different email 
address for each HTTP request. Requests are recorded 
and the generated email addresses are also monitored. 
If later they receive emails (which must be spam, since 
these email addresses are unused), Project Honey Pot 
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will know which IP addresses obtained those email ad- 
dresses, and which IP addresses sent the spam emails. 

Around 12% of the forum searching IPs found 
by SearchAudit were captured by Project Honey Pot. 
In contrast, among IP addresses that conduct normal 
queries such as craigslist, only 0.5% of them were 
listed. This shows that the captured forum searching IPs 
have a much higher chance of being caught spamming 
than the IP addresses of normal users. 

Figure 6 plots the matching percentages of different 
regular expression groups related to forum searching. We 
can see that, across different groups, the percentages of 
forum IPs appeared in Project Honey Pot are all signif- 
icant. This suggests that most of the forum-spamming 
groups are involved in email address scraping as well. 
For the largest forum-spamming group, which has 9125 
IP addresses, more than 30% of the IP addresses ap- 
peared in Project Honey Pot. It is possible that the re- 
maining 70% are also associated with spamming, but 
they could have targeted Web sites that are not part of 
their network, and are hence not captured. Hence, the 
analysis of search logs complements Project Honey Pot. 
It offers a unique view that allows us to observe all the 
IP addresses conducting forum searches, while Project 
Honey Pot allows us to see what the attackers do after 
performing the searches. 


8 Attack 3: Windows Live Messenger 
Phishing Attacks 


In this section, we study a series of Windows Live Mes- 
senger phishing attacks. The queries were not issued by 
attackers directly. Rather, they were triggered by normal 
users. In this section, we use SearchAudit to generate 
regular expressions and study this series of attacks. 


$8.1 Attack Process 


The scheme of these phishing attacks operates as fol- 
lows: 


1. The victim (say Alice) receives a message from one 
of her contacts, asking her to check out some party 
pictures, with a link to one of the phishing sites. 


2. Alice clicks the link and is taken to the Web page 
that looks very similar to the legitimate Windows 
Live Messenger login screen and asks her to enter 
her messenger credentials. Alice enters her creden- 
tials. 


3. Alice is now taken to a page 
http: //<domain-name>.com?user=alice, 
which redirects to image search results from a 
search engine (in this case, Bing) for party. 
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Figure 7: The rate at which new users were compromised by 
the phishing attack. 
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Figure 8: The timeline of how different domain names were 
used during the phishing attack. All lines of the same color 
correspond to the same IP address. 


4. The attackers now have Alice’s credentials. They 
log in to Alice’s account and send a similar message 
to her friends to further propagate the attack. 


We believe there are two reasons why the attackers use 
a search engine here. First, using images from a search 
engine is less likely to tip the victim off than if the images 
were hosted on a random server. Second, the attackers do 
not need to host the image Web pages themselves, and 
can thus offload the cost of hosting to the search engine 
servers. 


$8.2 Attack Scale 


Since this attack generated search traffic that contains the 
keyword party, we feed this keyword as the seed query 
into SearchAudit. Since all the queries of this attack are 
identical or similar, we modify SearchAudit to focus on 
the query referral field, which records the source of traf- 
fic redirection. SearchAudit generates two regular ex- 
pressions from the query referral field: 

1.http:// [a-zA-Z0-9._]*.<domain-name>/ 
2.http://<domain-name>?user=[a-zA-Z0-9._]* 


USENIX Association 


In the second regular expression, the pattern 
[a-zA-Z0-9._]* may seem like a random set of let- 
ters and numbers, but it actually describes usernames. 
In our example attack scenario, when Alice is redirected 
to the image search results, the HTTP-referrer is set to 
http: //<domain-name>.com?user=alice. Using 
this information, we can identify the set of users whose 
credentials may have been compromised. 

Using these regular expressions, SearchAudit identi- 
fies a large number of unique user names in the log col- 
lected from May 2008 to July 2009. Figure 7 shows the 
cumulative fraction of users compromised by this attack 
over time. When the attack first started, there was an ex- 
ponential growth phase, similar to other worm or virus 
breakouts. This phase ended around day 50, when most 
of the domains got blacklisted (see Figure 8). This attack 
then transited into a steady increase phase, until day 250 
when it broke out again. 

There are over 400 unique phishing domain names as- 
sociated with this attack. The top domains targeted more 
than 10° users. Around one third of the domains phished 
fewer than 100 users each. These domains were the ones 
that were quickly blacklisted. Figure 8 plots the timeline 
of how different domains were used over time. For read- 
ability, the plot contains only the top domains (out of the 
total 400 domains) that were responsible for compromis- 
ing at least 1000 users. The figure plots the domains on 
the Y-axis, and the days on which that domain was active 
on the X-axis. Each horizontal line corresponds to the 
set of days a particular domain was seen in our search 
log. The different colors correspond to the different IP 
addresses on which the Web pages were hosted. We ob- 
serve that though there were over 180 domain names in 
circulation, they were all hosted on only a dozen differ- 
ent IP addresses. It can also be seen that multiple do- 
main names were associated with an IP address at the 
same time. Therefore, it is not the case that a new do- 
main name was registered and used only after an older 
one was blocked. 


8.3. Characteristics of Compromised Ac- 
counts 


We find that the compromised accounts had a large num- 
ber of short login sessions (lasting less than one minute). 
These short login sessions were initiated from IPs in sev- 
eral different /24 subnets. Figure 9 shows the compar- 
ison between the short logins from multiple subnets for 
compromised users and for the other users. We see that 
for typical users, 99% of the short logins happened from 
fewer than 4 different subnets. However, for the compro- 
mised users, we see that more than 50% had short logins 
from 15 or more different subnets. 
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Figure 9: Number of different /24 subnets from which short 
logins happen. 


We also observe that many of the short logins came 
from IPs which were located in Hong Kong. Given 
that the phishing sites were also mostly located in Hong 
Kong, the attackers might have resources in Hong Kong, 
where they logged in to the compromised accounts and 
sent messages to spread the phishing attacks. 

Using these characteristics, we can then look back at 
the login patterns of all Windows Live Messenger users 
to identify more user accounts with similar suspicious 
login patterns, thus enabling us to take remedial actions 
for protecting a larger number of compromised users. 


9 Conclusion 


In this paper we present SearchAudit, a framework to 
identify malicious Web searches. By taking just a small 
number of known malicious queries as seed, SearchAu- 
dit can identify millions of malicious queries and thou- 
sands of vulnerable Web sites. Our analysis showes that 
the identification of malicious searches can help detect 
and prevent large-scale attacks, such as forum spamming 
and Windows Live Messenger phishing attacks. More 
broadly, our findings highlight the importance of ana- 
lyzing search logs and studying correlations between the 
various attacks enabled by malicious searches. 
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Abstract 


Web applications are the most common way to make ser- 
vices and data available on the Internet. Unfortunately, 
with the increase in the number and complexity of these 
applications, there has also been an increase in the num- 
ber and complexity of vulnerabilities. Current techniques 
to identify security problems in web applications have 
mostly focused on input validation flaws, such as cross- 
site scripting and SQL injection, with much less attention 
devoted to application logic vulnerabilities. 

Application logic vulnerabilities are an important class 
of defects that are the result of faulty application logic. 
These vulnerabilities are specific to the functionality of 
particular web applications, and, thus, they are extremely 
difficult to characterize and identify. In this paper, we 
propose a first step toward the automated detection of 
application logic vulnerabilities. To this end, we first use 
dynamic analysis and observe the normal operation of a 
web application to infer a simple set of behavioral spe- 
cifications. Then, leveraging the knowledge about the 
typical execution paradigm of web applications, we filter 
the learned specifications to reduce false positives, and 
we use model checking over symbolic input to identify 
program paths that are likely to violate these specifica- 
tions under specific conditions, indicating the presence 
of a certain type of web application logic flaws. We de- 
veloped a tool, called Waler, based on our ideas, and 
we applied it to a number of web applications, finding 
previously-unknown logic vulnerabilities. 


1 Introduction 


Web applications have become the most common means 
to provide services on the Internet. They are used 
for mission-critical tasks and frequently handle sensi- 
tive user data. Unfortunately, web applications are often 
implemented by developers with limited security skills, 
who often have to deal with time-to-market pressure and 
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financial constraints. As a result, the number of web ap- 
plication vulnerabilities has increased sharply. This is re- 
flected in the Symantec Global Internet Security Threat 
Report, which was published in April 2009 [12]. The re- 
port states that, in 2008, web vulnerabilities accounted 
for 63% of the total number of vulnerabilities reported. 

Most recent research on vulnerability analysis for web 
applications has focused on the identification and miti- 
gation of input validation flaws. This class of vulnera- 
bilities is characterized by the fact that a web application 
uses external input as part of a sensitive operation with- 
out first checking or sanitizing it properly. Prominent 
examples of input validation flaws are cross-site script- 
ing (XSS) [20] and SQL injection vulnerabilities [3, 32]. 
With XSS, an application sends to a client output that 1s 
not sufficiently checked. This allows an attacker to in- 
ject malicious JavaScript code into the output, which is 
then executed on the client’s browser. In the case of SQL 
injection, an attacker provides malicious input that alters 
the intended meaning of a database query. 

One reason for the prior focus on input validation vul- 
nerabilities is that it is possible to provide a concise and 
general specification that captures the essential charac- 
teristics of these vulnerabilities. That is, given a pro- 
gramming environment, it is possible to specify a set of 
functions that read inputs (called sources), a set of func- 
tions that represent security-sensitive operations (called 
sinks), and a set of functions that check data for mali- 
cious content. Then, various static and dynamic anal- 
ysis techniques can be used to ensure that there are no 
unchecked data flows from sources to sinks. Since the 
specification of input validation flaws is independent of 
the application logic, once a detection system is avail- 
able, it can be used to find bugs in many applications. 

While it is important to identify and correct input vali- 
dation flaws, they represent only a subset of the spectrum 
of (web application) vulnerabilities. In this paper, we ex- 
plore another type of application flaws. In particular, we 
look at vulnerabilities that result from errors in the logic 
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of a web application. Such errors are typically specific 
to a particular web application, and might be domain- 
specific. For example, consider an online store web ap- 
plication that allows users to use coupons to obtain a dis- 
count on certain items. In principle, a coupon can be 
used only once, but an error in the implementation of the 
application allows an attacker to apply a coupon an arbi- 
trary number of times, reducing the price to zero. 

So far, web application logic flaws have received little 
attention, and their treatment is limited to informal dis- 
cussions (a well-known example is the white paper by J. 
Grossman [14]). This is due to the fact that logic vulnera- 
bilities are specific to the intended functionality of a web 
application. Therefore, it is difficult Gf not impossible) 
to define a general specification that allows for the dis- 
covery of logic vulnerabilities in different applications. 

One possible approach would be to leverage an appli- 
cation’s requirement specification and design documents 
to identify parts of the implementation that do not respect 
the intended behavior of the application. Unfortunately, 
these documents are almost never available in the case of 
web applications. Therefore, other means to characterize 
the expected behavior of web application must be found 
for detection of application logic flaws. 

In this paper, we take a first step toward the automated 
detection of application logic vulnerabilities. Our ap- 
proach operates in two steps. In the first step, we infer 
specifications that (partially) capture a web application’s 
logic. These specifications are in the form of likely in- 
variants, which are derived by analyzing the dynamic ex- 
ecution traces of the web application during normal oper- 
ation. The intuition is that the observed, normal behavior 
allows one to model properties that are likely intended by 
the programmer. This step is necessary to automatically 
obtain specifications that reflect the business logic of a 
particular web application. In the second step, we ana- 
lyze the inferred specifications with respect to the web 
application’s code and identify violations. 

The current implementation of our approach is based 
on two well-known analysis techniques, namely, dy- 
namic execution to extract (likely) program invariants 
and model checking to identify specification violations. 
However, to the best of our knowledge, the way in which 
we combine these two techniques is novel, has never 
been applied to web applications, and has not been lever- 
aged to detect application logic flaws. Moreover, we had 
to significantly extend the existing techniques to capture 
specific characteristics of web applications and to scale 
them to real-world applications as outlined below. 

In the first step of our analysis, we used a well-known 
dynamic analysis tool [9, 11] to infer program specifica- 
tions in the form of likely invariants. We extended the 
existing general technique to be more targeted to the ex- 
ecution of web applications. In particular, we addressed 
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two main shortcomings of the general approach: the fact 
that many invariants that relate to important concepts of 
web applications were not identified (e.g., invariants re- 
lated to objects that are part of the user session) and the 
fact that many spurious invariants were generated as a re- 
sult of the limited coverage of the dynamic analysis step 
or because of artifacts in the analyzed inputs. 


To deal with spurious invariants, we developed two 
novel techniques to identify which derived invariants re- 
flect real (or “true”) program specifications. The first 
one uses the presence of explicit program checks, in- 
volving the variable(s) constrained by an invariant, as a 
clue that the invariant is indeed relevant to the behav- 
ior of the web application. The second one is based on 
the idea that certain types of invariants are intrinsically 
more likely to reflect the intent of the programmer. In 
particular, we focus on invariants that relate external in- 
puts to the contents of user sessions and the back-end 
database. The use of these techniques to filter the derived 
invariants allows for a more effective extraction of speci- 
fication of a web application’s behavior, when compared 
to previously-proposed approaches that accept all gener- 
ated likely invariants as correctly reflecting the behavior 
of a program. 


In the second step of the analysis, we use model check- 
ing over symbolic input to analyze the inferred specifica- 
tions with respect to the web application’s code and to 
identify which real invariants can be violated. We had to 
extend existing model checking tools with new mecha- 
nisms to take into account the unique characteristics of 
web applications. These characteristics include the fact 
that web applications are composed of modules that can 
be invoked in any order and that the state of the web 
application must also take into account the contents of 
back-end databases and other session-related storage fa- 
cilities. 

By following the two steps outlined above, it is possi- 
ble to automatically detect a certain subclass of applica- 
tion logic flaws, in which an application has inconsistent 
behavior with respect to security-sensitive functionality. 
Note that our approach is neither sound nor complete, 
and, therefore, it is prone to both false positives and false 
negatives. However, we implemented our approach in 
a prototype tool, called Waler, that is able to automati- 
cally identify logic flaws in web applications based on 
Java servlets. We applied our tool to several real-world 
web applications and to a number of student projects, and 
we were able to identify many previously-unknown web 
application logic flaws. Therefore, even though our tech- 
nique cannot detect all possible logic flaws and our tool 
is currently limited to servlet-based web applications, we 
believe that this is a promising first step towards the au- 
tomated identification of logic flaws in web applications. 
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In summary, this paper makes the following contribu- 
tions: 


e We extend existing dynamic analysis techniques to 
derive program invariants for a class of web applica- 
tions, taking into account their particular execution 
paradigm. 


e We identify novel techniques for the identification 
of invariants that are “real” with high probability 
and likely associated with the security-relevant be- 
havior of a web application, pruning a large number 
of spurious invariants. 


e We extend existing model checking techniques to 
take into account the characteristics of web appli- 
cations. Using this approach, we are able to iden- 
tify the occurrence of two classes of web applica- 
tion logic flaws. 


e We implemented our ideas in a tool, called Waler, 
and we used it to analyze a number of servlet-based 
web applications, identifying previously-unknown 
application logic flaws. 


2 Web Application Logic Vulnerabilities 


Web application vulnerabilities can be divided into two 
main categories, depending on how a vulnerability can be 
detected: (1) vulnerabilities that have common character- 
istics across different applications and (2) vulnerabilities 
that are application-specific. Well-known vulnerabilities 
such as XSS and SQL injection belong to the first cate- 
gory. These two vulnerabilities are characterized by the 
fact that a web application uses external input as part of a 
sensitive operation without first checking or sanitizing it. 
Vulnerabilities of the second type (such as, for example, 
failures of the application to check for proper user autho- 
rization or for the correct prices of the items in a shop- 
ping cart) require some knowledge about the application 
logic in order to be characterized and identified. In this 
paper, we focus on this second type of vulnerabilities, 
and we call them web application logic vulnerabilities. 

To detect web application logic vulnerabilities auto- 
matically, one needs to provide the detection tool with a 
specification of the application’s intended behavior. Un- 
fortunately, these specifications, whether formal or infor- 
mal, are rarely available. Therefore, in this work, we pro- 
pose an automated way to detect application logic vul- 
nerabilities that do not require the specification of the 
web application behavior to be available. Our intuition is 
that often the application code contains “clues” about the 
behavior that the developer intended to enforce. These 
“clues” are expressed in the form of constraints on the 
values of variables and on the order of the operations per- 
formed by the application. 
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There are many ways in which constraints can be im- 
plemented in an application. In this work, we focus on 
two concrete types of constraints. The first (and most in- 
tuitive) way to encode application-specific constraints 1s 
in the form of program checks (1.e., if-statements). The 
presence of such a check in the program before certain 
data or functionality is accessed often represents a “clue” 
that either the range of the allowed input should be lim- 
ited or that an access to an item is limited. The absence of 
a similar check on an alternate program path to the same 
program point might represent a vulnerability. For ex- 
ample, vulnerabilities like authentication bypass, where 
an attacker is able to invoke a privileged operation with- 
out having to provide the necessary credentials, could be 
detected using this approach. 

The second type of constraints, which often exist in 
web applications, is the implicit correlation between the 
data stored in back-end databases and the data stored in 
user sessions. More specifically, in web applications, 
databases are often used to store persistent data, and user 
sessions are used to store the most accessed parts of this 
data (such as user credentials). Thus, there often exist 
implicit constraints on what is currently stored in the user 
session when a database query is issued. A “clue,” in 
this case, is an explicit relation between session data and 
database data. Certain application logic vulnerabilities, 
like unauthorized editing of a post belonging to another 
user, can be detected if a path where these relations are 
violated is found. More detailed examples of this type of 
vulnerabilities will be provided in Section 4.3.2. 


3 Detection Approach 


Based on the discussions in the previous section, it is 
clear that an analysis tool that aims to detect web appli- 
cation logic vulnerabilities requires a specification of ex- 
pected behavior of the program that should be checked. 
If such specifications are available (e.g., in the form of 
formal specifications or unit testing procedures), they can 
be leveraged to validate the behavior of the application’s 
implementation. However, in many cases there is no spe- 
cification of the expected behavior of a web application. 
In these cases, we need a way to derive it in an automated 
fashion. 

A number of techniques has been proposed by vari- 
ous researchers to derive program specification automat- 
ically. However, regardless of the approach used, none 
of them can derive a complete specification without hu- 
man feedback. To overcome this problem, we propose to 
use one of the existing dynamic techniques to derive par- 
tial program specifications and use an additional analysis 
step to refine the results and find vulnerabilities. 

In particular, we observe that web applications are typ- 
ically exercised by users in a way that is consistent with 
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the intentions of the developers. More specifically, users 
usually browse the application by following the provided 
links and filling out forms with expected input. These 
program paths are usually well-tested for normal input. 
As a result, when monitoring a web application whose 
“regular” functionality is exercised, it is possible to infer 
interesting relationships between variables, constraints 
on inputs and outputs, and the order in which the applica- 
tion’s components are invoked. This information can be 
used to extract specifications that partially characterize 
the intended behavior of the web application. 

As aresult, in our approach, we use an initial dynamic 
step where we monitor the execution of a web applica- 
tion when it operates on a number of normal inputs. In 
this step, it is important to exercise the application func- 
tionality in a way that is consistent with the intentions of 
the developer, i.e., by following the provided links and 
submitting reasonable input. Note that the information 
about a web application’s “normal” behavior cannot be 
gathered using automatic-crawling tools, as these tools 
usually do not interact with an application following the 
workflow intended by the developer or using inputs that 
reflect normal operational patterns. 

In this work, as the result of the dynamic analysis 
step, we infer partial program specifications in the form 
of likely invariants. These invariants capture constraints 
on the values of variables at different program points, 
as well as relationships between variables. For exam- 
ple, we might infer that the Boolean variable isAdmin 
must be true whenever a certain (privileged) function 
is invoked. As another example, the analysis might de- 
termine that the variable freeShipping is true only 
when the number of items in the shopping cart is greater 
than 5. We believe that these invariants provide a good 
base for the detection of logic flaws because they often 
capture application-specific constraints that the program- 
mer had in mind when developing the web application. 
Of course, it is unlikely that the set of inferred invari- 
ants represents a complete (or precise) specification of a 
web application’s functionality. Nevertheless, it provides 
a good, initial step to obtain a model of the intended be- 
havior of a program and can be used to guide further, 
more elaborate program analysis. 

As the second step of the analysis, we use model 
checking with symbolic inputs to check the inferred spe- 
cifications. The goal is to find additional evidence in 
the code about which invariants are likely to be part of 
the real program specification and then to identify paths 
where these invariants are violated. 

A naive approach would assume that all the generated 
invariants represent real invariants (specifications) for an 
application. Unfortunately, this straightforward solution 
leads to an unacceptably large number of false positives. 
The reason is the incompleteness of the dynamic analysis 


19th USENIX Security Symposium 


step. In particular, the limited variety of the input data 
frequently leads to the discovery of spurious invariants 
that do not reflect the intended program specification. To 
address this problem, we propose two novel techniques 
to distinguish between spurious and real program invari- 
ants. 

The first technique aims to distinguish between a spu- 
rious and a true invariant by determining whether a pro- 
gram contains a check that involves the variables con- 
tained in the invariant on a path leading to the pro- 
gram point for which this likely invariant was gener- 
ated. A check on a variable is a control flow operation 
that constrains this variable on a path. For example, the 
if-statement if (isAdmin == true) {...} repre- 
sents a check on the variable isAdmin. Intuitively, we 
assume that a certain invariant was intended by a pro- 
grammer if there is at least one program path that con- 
tains checks that enforce the correctness of this invariant 
(i.e., the checks imply that the invariant holds). We call 
such invariants supported invariants. When we find a 
supported invariant that can be violated on an alterna- 
tive program path leading to the same program point, we 
report this as a potential application logic vulnerability. 
When a likely invariant can be violated, but there are no 
checks in the program that are related to this invariant, 
then we consider it to be spurious. 

The second technique identifies a certain type of in- 
variant that we always consider to reflect actual program 
specifications. These invariants represent equality rela- 
tions between web application state variables (in partic- 
ular, variables storing the content of user sessions and 
database contents). Relationships of that kind often re- 
flect important internal consistency constraints in a web 
application and are rarely coincidental. A vulnerability 
is reported when the analysis determines that the equal- 
ity relation is not enforced on all paths. 

The vulnerability detection process and our techniques 
to distinguish between spurious and real invariants are 
discussed in more detail in Section 4.3. 


4 Implementation 


We chose to implement the proposed approach for 
servlet-based web applications written in Java. Servlets 
are frequently used for implementing web applications. 
In addition, there are a number of existing tools available 
for Java that can be used for program analysis. In this 
section, we describe the tools that we used, the exten- 
sions that we developed, and the challenges that we had 
to overcome to make them work together. 

We first briefly introduce servlets [24]. A _ typi- 
cal servlet-based web application consists of servlets, 
static documents, client-side code, and descriptive meta- 
information. A servlet is a Java-based web component 
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package myapp; 

public class User { 
private String username; 
private String role; 

} 

public class Order { 
private int tax; 
private int total; 
private Cart cart; 

} 

public: class-Cact, 4 
private List products; 
private int total; 


} 
Class Definitions 


_jspService (javax.servlet.http.HttpServletRequest req, 
jJavax.servlet.http.HttpServletResponse res) 
SST EXITLOG 


// invariants for the field "role" belonging to an 
// object stored in the session under the key "user" 
reg.session.user.role != null 
req.session.user.role.toString == ‘*‘admin’’ 


jj divariants. fox the fields. “cart” and. "total" 
// stored in the session under the key "order" 
req.session.order.cart.total 

== req.session.order.total 
reg.session.order.total > req.session.order.tax 


Generated Invariants 


Figure 1: Example of invariants generated for an exit 
point on line 106 of the _jspService method of a servlet. 


whose methods are executed on the server in response to 
certain web requests. Servlets are managed by a servlet 
container, which is an extension of a web server that 
loads/manages servlets and provides services via a well- 
defined API. These services include receiving and map- 
ping requests to servlets, sending responses, caching, en- 
forcing security restrictions, etc. Servlets can be devel- 
oped as Java classes or as JavaServer Pages (JSPs). JSPs 
are a mix of code and static HTML content, and they are 
translated into Java classes that implement servlets. 


4.1 Deriving Specifications 


As mentioned previously, in this work, we consider pro- 
gram specifications that can be expressed as invariants 
over program variables. To derive these invariants, we 
leverage Daikon [9, 11], a well-known tool for dynamic 
detection of likely program invariants. 


Daikon. Daikon generates program invariants using ap- 
plication execution traces, which contain values of vari- 
ables at concrete program points. It is capable of gene- 
rating a wide variety of invariants that cover both single 
variables (e.g., total > 50.0) and relationships between 
multiple variables (e.g., total = price * num + tax). 
Daikon-generated invariants are called likely invariants 
because they are based on dynamic execution traces and 
might not hold on all program paths. 
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Daikon comes with a set of front-ends. Each front- 
end is specific to a certain programming language (such 
as C or Java). The task of a front-end is to instrument 
a given program, execute it, and create data trace files. 
These trace files are then fed to Daikon for invariant gen- 
eration. For our analysis, we leveraged the existing front 
end for Java, called Chicory, and plugged it into a JVM 
on top of which the Tomcat servlet engine [13] is exe- 
cuted. This allowed us to intercept and instrument all 
servlets executed by the Tomcat server. 


The current implementation of Chicory produces 
traces only for procedure entry and exit points and non- 
local variables. Therefore, Daikon generates invariants 
for method parameters, function return values, static and 
instance fields of Java objects, and global variables. 


Our changes. In addition to altering Chicory’s invoca- 
tion model to work with Tomcat, we extended Chicory 
with a way to include the content of user sessions into 
the generated execution traces. Invariants over this data 
are important for the vulnerability analysis of web appli- 
cations because user sessions are an integral part of an 
application’s state and directly affect its logic. 


The content of user sessions is stored by a servlet con- 
tainer in the form of dynamically-generated mappings 
from a key to a value, 1.e., as elements in a hash map con- 
tainer. We found that, given the current design of Daikon 
and Chicory, it is not possible to generate useful invari- 
ants for the contents of such containers. The reason is 
that Daikon requires the type and the name of all vari- 
ables that can appear at a particular program point to be 
declared before the first trace for a particular program 
point is generated. This information is not available be- 
forehand for containers like hash maps because they are 
dynamically-sized and can contain elements of different 


types. 


To generate valid traces for Daikon, Chicory gener- 
ates all declarations for program points at the applica- 
tion loading time. At this time, it needs to know the ex- 
act type of each variable/object in declaration to be able 
to traverse the object structure and generate precise (or 
interesting) invariants. For example, in order to gener- 
ate a definition for the field role of the object of type 
User (defined in Figure 1), which might be stored in the 
user session of a servlet application under the key “user,” 
Chicory needs to know that the object of the type User is 
expected in the session. 


To overcome these problems, we provide our front- 
end with possible mappings from a key to an object type 
that can be observed in a session during execution. For 
example, for the code shown in Figure 1, we would need 
to provide the following mappings: 
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user snivyapp. User 

CartsmyappsGart 

order:myapp.Order 

We modified Chicory to use this information to gener- 
ate more precise traces for session data. This information 
allows for the generation of more interesting invariants, 
such as the ones shown in the Figure 1. We extended the 
front-end to generate traces for the content of user ses- 
sions for every method in an application. As future work, 
we plan to generate these mapping automatically for ar- 
bitrary containers by generating new declarations as new 
elements are found in a container, and then merging the 
resulting traces before feeding them to Daikon. 

To generate program execution traces, we wrote 
scripts to automatically operate web applications. For 
each application, these scripts simulate typical user ac- 
tivities, such as creating user accounts, logging into the 
application, choosing and buying items from a store, ac- 
cessing administrative functionality, etc. The main idea 
of this step is to exercise the application’s common ex- 
ecution paths by following the links and filling out the 
forms presented to the user during a typical interaction 
with the application. The final outcome of the dynamic 
analysis step is a file containing a serialized version of 
likely invariants for the given web application. These 
invariants serve as a (partial, simplified) specification of 
the web application, and they are provided as input to the 
next step of the analysis. 


4.2 Model Checking Applications 


Once the approximate specifications (i.e., the likely in- 
variants) for a web application have been derived, the 
next step is to analyze the application for supporting 
“clues” and identify invariants that are part of a true pro- 
gram specification. Any violation of such an invariant 
represents a vulnerability. 

We chose to use model checking for this step of the 
analysis and implemented it in a tool called Waler (Web 
Application Logic Errors AnalyzeR). Given a servlet- 
based application and a set of likely invariants, Waler 
systematically instantiates and executes symbolically the 
servlets of the application imitating the functionality of 
a servlet container. As the application is executed, Wa- 
ler checks the truth value of provided likely invariants, 
analyzes the application’s code for “clues,” and reports 
possible logic errors. In this section, we describe the ar- 
chitecture and execution model of Waler. Then, in Sec- 
tion 4.3 we explain how Waler identifies interesting 1n- 
variants and application logic vulnerabilities. 


4.2.1 System Top-level Design 


Waler is implemented on top of the Java PathFinder (JPF) 
framework [19, 35], and its general architecture is shown 
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Figure 2: Waler’s architecture. 


Figure 2. In this figure, dark gray boxes represent new 
modules that we implemented, while dotted (light gray) 
boxes represent parts of JPF that we had to extend. 


JPF overview. JPF is an open-source, explicit-state 
model checker that implements a JVM. It systemati- 
cally explores an application’s state space by executing 
its bytecode. JPF consists of a number of configurable 
components. For example, the specific way in which an 
application’s state space is explored depends on a cho- 
sen Search Strategy — JPF core distribution includes a 
number of basic strategies. The State Serializer compo- 
nent defines how an application state is stored, matched 
against others, and restored. JPF also comes with a num- 
ber of interfaces that allow for its functionality to be ex- 
tended and modified in arbitrary ways. 


In general, JPF is capable of executing any Java class- 
file that does not depend on platform-specific native 
code, and many of the Java standard library classes can 
run on top of JPF unmodified. However, in JPF, some of 
the Java library classes are replaced with their model ver- 
sions to reduce the complexity of their real implementa- 
tions and/or to enable additional features. For example, 
Java classes that have native method calls (such as file 
I/O) have to be replaced by their models, which either 
emulate the required functionality or delegate the native 
calls to the actual JVM on top of which JPF is executed. 
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Also, JPF comes with a number of extensions that pro- 
vide additional functionality on top of JPF. Below, we 
discuss the JPF-SE extension for JPF, which we lever- 
aged in Waler to enable symbolic execution. 


The JPF-SE Extension. The JPF-SE extension for JPF 
enables symbolic execution of programs over unbounded 
input when using explicit-state model checking [2]. With 
this extension, the Java bytecode of an application needs 
to be transformed so that all concrete basic types, such 
as integers, floats, and strings, are replaced with the cor- 
responding symbolic types. Similarly, concrete opera- 
tions need to be replaced with the equivalent operations 
on symbolic values. For example, all objects of type int 
are replaced with objects of type Expression. An addition 
of two integers is replaced with a call to the _p/us method 
of the Expression class. Following the standard symbolic 
execution approach, all newly-generated constraints are 
added to the path condition (PC) over the current execu- 
tion path. The generation of constraints is done in the 
methods of symbolic classes, and it is transparent to the 
application. Whenever the PC is updated, it is checked 
for satisfiability with a constraint solver, and infeasible 
paths are pruned from execution. 

Unfortunately, we found that JPF-SE was missing a 
considerable amount of functionality that needed to be 
added to make the system suitable for real-world appli- 
cations. For example, the classes implementing symbolic 
string objects were missing a significant number of sym- 
bolic methods with respect to the java.lang.String API, 
which is used extensively in web applications. Also, in 
order to execute an arbitrary application using JPF-SE, 
symbolic versions of many standard Java libraries are re- 
quired. These libraries were not provided with the ex- 
tension. Finally, a tool to perform the necessary transfor- 
mations of Java bytecode was not publicly available, and, 
therefore, we implemented our own transformer by lever- 
aging ASM [25], a Java bytecode engineering library. 
Waler overview. In order to execute servlet-based web 
applications and analyze them for logic errors, we had to 
extend JPF in a number of ways. As shown in Figure 2, 
we implemented from scratch four main components: the 
Application Controller (AC), the Vulnerability Analysis 
Strategies (VAS), the Program Checks Analyzer (PCA), 
and the Likely Invariants Analyzer (LIA). The AC com- 
ponent is responsible for loading, mapping, and system- 
atically initiating execution of servlets in a servlet-based 
application. As the analyzed application itself, it runs on 
top of the JVM implemented by core-JPF and uses sym- 
bolic versions of Java libraries. 

The other three components are internal to JPF, 1.e., 
they are not visible to web applications and do not rely 
on model classes. The LIA component is responsible for 
parsing Daikon-generated invariants and checking their 
truth value as a program executes. The PCA component 


USENIX Association 


keeps track of all the program checks performed by an 
application on an execution path. Finally, the VAS com- 
ponent provides various strategies for vulnerability de- 
tection based on the information provided by LIA and 
PCA. We provide more details on how these modules 
work in the following sections. 

In addition, we had to extend a number of existing JPF 
components to address the needs of our analysis. In par- 
ticular, we modified existing search strategies, state in- 
formation tracking, and implemented some missing parts 
of JPF-SE. Due to space limitations, we will not explain 
all of the changes unless they are significant for under- 
standing our approach. 

Finally, we extended JPF with a set of 40 model 
classes that provide the servlet API and related inter- 
faces (such as the JSP API). These classes implement the 
standard functionality of a servlet container, but instead 
of reading and writing actual data from/to the network, 
they operate on symbolic values. Our implementation is 
based on the real implementation of the servlet container 
for Tomcat. 


4.2.2 Execution Model 


To systematically analyze a web application for logic er- 
rors, Waler needs to be able to model all possible user in- 
teractions with the application. To achieve that, it needs 
to find all possible entry points to the application and 
execute all the possible sequences of invocations using 
symbolic input. 

In general, a user can interact with a web application 
in different ways: one can either follow the links (leading 
to URLs) presented by the application (as part of a web 
page) or can directly point the browser to a certain URL. 
On the server side, after (and if) a request URL is mapped 
to a servlet-based application, the path part of the URL 
is used to locate a particular servlet that will handle the 
request. We call the set of all such URL paths that lead to 
the invocation of a servlet the “application entry points.” 

Thus, before a program can be analyzed, we need to 
identify all possible application entry points. In the gen- 
eral case, there can be an infinite number of URLs that 
lead to an invocation of a servlet; however, for each par- 
ticular application, there is a finite and well-defined num- 
ber of possible mappings from a request URL pattern to 
a servlet. Thus, for the analysis, it is sufficient to find all 
such mappings. For example, if an application has the 
URL login mapped to the AuthManager servlet and the 
URLs /cart and /checkout mapped to the CartManager 
servlet, it can be said that the application has three entry 
points. In servlet-based applications, it is also possible 
to have wildcard mappings, such as account/*, mapped 
to a servlet. In this case, all URL paths starting with 
/account/ are mapped to the same servlet. We consider 
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such mappings to represent single entry points and sim- 
ply treat the part of the URL that matches the “*” as a 
symbolic input. This is consistent with our handling of 
other request parameters accessed by servlets, which are 
also represented by symbolic values. 

To find all entry points, our system inspects the ap- 
plication deployment descriptor (typically, the web.xml 
file), which defines how URLs requested by a user are 
mapped to servlets. When analyzing the URL-to-servlet 
mapping, we take into account that not all servlets are 
directly accessible to users (those servlets that are not 
directly accessible are typically invoked internally by 
other servlets). Following the standard servlet invocation 
model, all URLs that point to accessible (public) servlets 
are assumed to be possible entry points. 

Once the application’s entry points are determined, the 
Application Controller systematically explores the state 
space of the application. To this end, it initiates execu- 
tion of servlets by simulating all possible user choices of 
URLs. For example, if the application has three servlets 
mapped to the URLs /login, /cart, and /checkout, the ap- 
plication controller attempts to execute all possible com- 
binations (sequences) of these servlets. The actual or- 
der in which servlets are explored depends on the chosen 
search strategy. JPF offers a limited depth-first search 
(DFS) and a heuristics-based breadth-first search (BFS) 
strategy. We found that DFS works better for our sys- 
tem because it requires significantly less memory dur- 
ing model checking. With DFS, a path is explored until 
the system reaches a specific (configurable) limit on the 
number of entry points that are executed. 


4.2.3 State Space Management 


Similar to other model checkers, Waler faces the state 
explosion problem. Thus, to make Waler scale to real- 
world web applications, we had to take a number of steps 
to manage (limit) the exponential growth of the appli- 
cation’s state space. In particular, after careful analysis 
of several servlet-based applications, we found that JPF 
often fails to identify equivalent states. The two main 
reasons for that are: (1) the constraints added to the sym- 
bolic PC are never removed from it due to the design of 
JPF-SE!, and (2), without domain-specific knowledge, 
JPF is not able to identify “logically equivalent” states. 
Here we present three techniques that we implemented 
to overcome these problems. 


States in JPF. JPF comes with some mechanisms to 
identify equivalent states. A state in JPF is a snapshot 
of the current execution status of a thread, and it con- 
sists of the content of the stack, heap, and static variables 
storage. This snapshot is created when a sequence of ex- 
ecuted instructions reaches a choice point, 1.e., a point 
where there is more than one way to proceed from the 
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current instruction. Choice points are thread-scheduling 
instructions, branching instructions that operate on sym- 
bolic values, or instructions where a new application en- 
try point needs to be chosen. Whenever JPF finds a 
choice point, a snapshot of the current state is created. 
Then, the serialized version of the state is compared to 
hashes of previously-seen states. The execution path is 
terminated when the same state has been seen before. 

We found that the basic version of JPF performs 
garbage collection and canonicalization of objects on the 
heap before hashing a state. However, it does not per- 
form any additional analysis of memory content when 
comparing states for equality, as JPF has no knowledge 
of the domain-specific semantics of the objects in mem- 
ory. As a result, JPF fails to recognize certain states 
as logically equivalent. This leads to a large number of 
states that are created unnecessarily. We discuss exam- 
ples of some cases in which the standard JPF mechanism 
fails to identify equivalent states below. 


States in Waler. In Waler, we extend the concept of 
JPF state to a “logical state’ using the domain-specific 
knowledge that Waler has about web applications. In 
particular, we observe that the only information that 1s 
preserved between two user requests in a servlet-based 
application are the content of user sessions, application- 
level contexts, the symbolic PC (which stores constraints 
on symbolic variables stored in sessions), and data on 
persistent storage. Since we do not model persistent stor- 
age in Waler and always return a new symbolic value 
when it is accessed, we ignore this information in our 
analysis. Thus, the logical state of servlet-based applica- 
tion is defined as the content of user sessions and appli- 
cation contexts, and the PC. This is the only information 
that should be considered when comparing states after 
execution of a user request is finished. 


State space reduction. Given the design of JPF and us- 
ing our concept of logical state, we implemented three 
solutions to reduce the state space of a web application. 

First of all, we implemented an additional analysis 
step to remove a constraint from the PC when it includes 
at least one variable that is no longer live*. This is espe- 
cially important when the execution of a user request is 
finished, because, in a web application, input received by 
one servlet is independent from input received by another 
servlet, and, unless parts of it are stored in a persistent 
storage, any constraints on previous input are unrelated 
to the new one. The implemented solution is safe (it does 
not affect the soundness of the analysis) and allows our 
system to identify many states that are equivalent. 

The second solution to reduce an application state 
Space is to prune many “irrelevant” paths from state 
exploration. Consider, for example, an /error servlet, 
which simply displays an error message, or a /products 
servlet, which displays a list of available products. Exe- 
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public void _jspService (HttpServletRequest req, 
HttpServletResponse res) { 


1 

2 

3 

4 User user = (User) session.getAttribute ("User"); 
5 1f(user==null) { 

6 User.adminLogin (request, response) ; 

7 return; 

8 } 

9 oe 

10 if (request.getMethod().equalsIgnoreCase("post")) { 
11 result = website.variables. 

12 insert (new Variable(req) ); 

13 } 

14 } 


/admin/variables/Add.jsp 


public void _jspService (HttpServletReguest req, 


{ 


1 

2 HttpServletResponse res) 

3 

4 User user = (User) session.getAttribute ("User"); 
5 1f (user==null || (!user.isAdmin())) { 

6 User.adminLogin (request, response) ; 

7 return; 

8 } 

9 pats 

10 out.printin("<a href=\"admin/variables/\ 


11 Add.jsp\">Add New</a>") ; 


/admin/variables/index.jsp 


Figure 3: Simplified version of an unauthorized access vulnerability in the JspCart application. 


cuting such servlets often results in changes to the state 
of the memory, for example, due to different Java classes 
that must be loaded. However, once such a servlet is ex- 
ecuted, the application is still in the same logical state. 
Also, the state after executing, for example, the servlet 
/login will be logically equivalent to the state resulting 
from the execution of the sequence of servlets [/error, 
/login]. From this observation, it is clear that it would be 
beneficial to identify servlets whose executions do not 
modify the logical state of the application. The reason 
is that there is no need to consider them for vulnerabil- 
ity analysis. Therefore, after a servlet 1s executed, we 
analyze the content of the application’s memory to de- 
termine whether the application logical state has been 
changed (for example, because of changes to the content 
of the user session). When no changes are detected, the 
exploration of the current execution path is terminated. 
This modification also does not compromise the sound- 
ness of the analysis, assuming that the memory analysis 
takes into the account all the component of the applica- 
tion logical state. 


A third technique to limit the state space explosion 
problem is to identify irrelevant entry points, so that the 
servlets mapped to these URLs do not need to be ex- 
ecuted. More precisely, during model checking, when 
our analysis determines that a servlet does neither read 
from nor write to the application’s logical state at all, the 
execution of this page can be ignored for all other exe- 
cution paths. The pruning of irrelevant servlets is espe- 
cially helpful in large applications, where the execution 
of a servlet over symbolic inputs can take several min- 
utes (and thus, can result in days of model checking time 
if the servlet is executed on multiple paths). 


To summarize, the state explosion problem that can 
rise in the model checking of web applications can be 
significantly improved in many cases. In particular, we 
developed the following three techniques to limit the 
growth of an application’s state space: we improved the 
existing JPF state hashing algorithm to disregard a path 
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condition when its variables are out of scope, we found 
a way to prune the exploration of irrelevant paths, and 
we identify irrelevant servlets and discard them from our 
vulnerability analysis. We found that these techniques 
often allow for a significant reduction in the number of 
states explored by Waler. For example, running Waler on 
the Jebbo-2 application (described in Section 5) without 
using any of our state reduction techniques resulted in 
the execution of 322,637 states, and it took around 223 
minutes to terminate. When the same application was 
executed using our three heuristics, Waler terminated in 
about a minute and needed to explore only 529 states to 
obtain the same result. 


4.3. Vulnerability Detection 


As described in the previous section, Waler uses model 
checking to systematically explore the state space of an 
application. During the model checking process, the sys- 
tem checks whether the likely invariants generated by 
Daikon for a program point hold whenever that point is 
reached. In our current implementation, we only con- 
sider likely invariants that are generated for exit points 
of methods (note that we differentiate between different 
exit points). The reason is that methods often check their 
parameters inside the function body (rather than in the 
caller). As a result, entry invariants are typically less sig- 
nificant. 

To see an example of invariants that can be produced 
by our system, consider the code in Figure 3, which 
shows a vulnerability that Waler found in the JspCart ap- 
plications (see Section 5). The left listing shows the code 
of the /admin/variables/Add.jsp servlet, which is a privi- 
leged servlet that should only be invoked by an adminis- 
trator. This is reflected by the set of likely invariants that 
are generated for the exit point on Line 14 for Add.jsp°: 


(1) session.User != null 
(2) session.User.isAdmin == true 
(3) session.User.txtUsername == "admin@jspcart.com" 


It can be seen that the first two invariants are part of 
the “true” program specification, while the third invariant 
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is Spurious (an artifact of the limited test coverage). As 
a side note, the invariant for the exit point at Add.jsp: 
Line 7 would be session.User ==null. 

To help us to determine whether a likely invariant 
holds or fails on a path, we implemented the Program 
Checks Analyzer module that keeps information about 
all the checks performed on an execution path. When 
a comparison instruction is executed, the PCA records 
the names of the variables involved and the result of the 
comparison. Also, the PCA keeps track of all variable 
assignments in the program. As a result, whenever the 
PCA encounters a check that operates on local variables, 
it can determine how this check constrains (affects) non- 
local variables. Recall that Daikon does not generate in- 
variants for local variables, and, therefore, we are not in- 
terested in comparisons over local variables unless they 
store session data or method parameters. 

Consider now what happens when Waler analyzes the 
Add.jsp servlet. After Waler executes the if-statement on 
Line 5, information about a new check is added to the 
set of current constraints accumulated by the PCA. If the 
user is authenticated, the value stored in the session 
object under the key User is not null. In this case, 
the PCA adds session.User !=nul11 to the set of 
checks along the current execution path, and the execu- 
tion proceeds at Line 9*. Otherwise, the PCA records the 
fact session.User ==nul11, and execution proceeds 
at Line 6. 

Once the Line 14 of Add. jsp is reached, Waler checks 
whether all likely invariants generated for this point hold. 
A likely invariant holds on the current path if we can 
determine that the relationship among the involved vari- 
ables is true. An invariant fails otherwise. To determine 
whether a likely invariant holds, we check whether the 
truth of this invariant can be determined directly given 
the current application state (1.e., the invariant involves 
concrete values). If not, we check whether the set of 
constraints accumulated on the current path implies the 
relationship defined by the invariant using the constraint 
solver employed by the JPF-SE. 

Following the example, it can be seen that the first in- 
variant for Line 14 always holds (because of the check on 
Line 5), while the other two might fail on some paths. In 
principle, we could immediately report the violations of 
the last two invariants as a potential program flaw. How- 
ever, this would raise too many false positives, due to 
spurious invariants. In the following sections, we intro- 
duce two techniques to identify those invariants that are 
relevant to the detection of web application logic flaws. 


4.3.1 Supported Invariants 


The first technique to identify real invariants is based on 
the insight that many vulnerabilities are due to developer 


19th USENIX Security Symposium 


oversights. That is, a developer introduces checks that 
enforce the correct behavior on most program paths, but 
misses an unexpected case where the correct behavior 
can be violated. 


To capture this intuition, we defined a technique that 
keeps track of which paths contain checks that support an 
invariant and which paths are lacking such checks. More 
precisely, an execution path on which a likely invariant 
holds and it is supported by a set of checks on that path 
is added to the set of supporting paths for this invariant. 
That is, along a supporting path, the program contains 
checks that ensure that an invariant is true. A path on 
which a likely invariant can fail is added to the set of 
violating paths. When a likely invariant holds on all pro- 
gram paths to a given program point, then we know that 
it holds for all executions and there is no bug. When all 
paths can possibly violate a likely invariant, then we as- 
sume that the programmer did not intend this invariant 
to be part of the actual program specification, and it is 
likely an artifact of the limited test coverage. An appli- 
cation logic error is only reported by Waler if at least one 
supporting path and at least one violating path are found 
for an invariant at a program point. 

Let us revisit the example of Figure 3. Waler deter- 
mines that the first invariant on Line 14 of Add. jsp 
always holds. The third one is never supported, and, 
thus, it is correctly discarded as spurious. Moreover, 
Waler finds a violating path for the second invariant 
(session.User.isAdmin == true) by calling the 
Add.jsp servlet with a user in non-administrative role. 
However, the system also inspects the path where in- 
dex.jsp is called first, which reflects the normal, in- 
tended flow of the application. This servlet, shown on 
the right of Figure 3, contains a check on Line 5 that 
adds the fact session.User.isAdmin == true to 
the PC (assuming that the user is authenticated as an 
administrator). In this case, when Add.jsp is invoked 
after index.jsp, the system determines that the invari- 
ant session.User.isAdmin == true holds and is 
supported. Thus, Waler finds a supporting path for this 
invariant. As a result, the fact that one can execute the 
main method of Add. jsp directly, violating its exit invari- 
ant session.User.isAdmin == true, is correctly 
recognized as an unauthorized access vulnerability. 

We found that checking for supported invariants works 
well in practice. However, it can produce false posi- 
tives and is not capable of capturing all possible logic 
flaws. The main source of false positives stems from the 
problem that the violation of an invariant, even when it 
is supported by a program check on some paths, does 
not necessarily result in a security vulnerability. For ex- 
ample, access to a normally protected page does not al- 
ways result in a vulnerability because either (1) a sensi- 
tive operation performed by the page fails if a set of pre- 
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public void _jspService (HttpServletRequest req, 
HttpServletResponse res) { 


if (req.getMethod() == "GET") { 


1 
2 
3 
A 
5 ne & 
6 out.printin("<form method=post" 
7 
8 
9 


+ " action=\"edituser.jsp\">"); 
out.printlin("<input type=hidden" 
+ " name=\"username\" value=" 
10 + session.getAttribute ("username") + ">"); 
11 ids 
12 Ou printin( </form>"); 
13} 
4 if (reg.getMethod() == "POST") { 


16 stmt = conn.prepareStatement ("UPDATE users SET" 


17 + " password = ?, name = ? WHERE username = ?"); 
18 stmt.setString(l, req.getParameter ("password") ); 
19 stmt.setString(2, req.getParameter ("name") ); 
20 stmt.setString(3, req.getParameter ("username") ); 
21 stmt .executeUpdate (); 
20. 
23 } 

edituser.jsp 


Figure 4: Simplified user profile editing vulnerability 
(Jebbo-6). 


public void doPost (HttpServletReguest req, 
HttpServletResponse res) { 


1 

2 

a eee 

4 sess = regquest.getSession (true) ; 

5 if (action.equals("/editpost") ) { 

6 Ss = conn.prepareStatement ("UPDATE posts SET" 
7 + “-author= 2, title = ?; entry = 7" 

8 pe Ee ee, Cn 

9 


s.setString(l, (String)sess.getAttribute ("auth") ); 
10 s.setString(2, req.getParameter ("title") ); 
11 s.setString(3, req.getParameter ("entry") ); 
12 s.setString(4, regq.getParameter("1id")); 
13 s.executeUpdate (); 


PostController.java 


Figure 5: Simplified post editing vulnerability (Jebbo-5). 


conditions, uncontrolled by an attacker, is not satisfied, 
or (2) there is no sensitive operation on the path executed 
during the access. Reasoning about these cases is ex- 
tremely hard for any automated tool. However, we found 
that such false positives often indicate non-security bugs 
in the code, and, thus, they are still useful for a developer. 
This technique also fails to identify logic vulnerabilities 
when the programmer does not introduce any checks for 
a security-relevant invariant at all. In such cases, Wa- 
ler incorrectly concludes that an invariant is spurious be- 
cause it cannot find any support in the code. To improve 
this limitation, we introduce an additional technique in 
the following section. 


4.3.2 Internal Consistency 


As mentioned previously, Waler will discard invariants 
as spurious when they are not supported by at least one 
check along a program path. This can lead to missed 
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vulnerabilities when the invariant is actually security- 
relevant. To address this problem, we leverage general 
domain knowledge about web applications and identify 
a class of invariants that we always consider significant, 
regardless of the presence of checks in the program. 

We consider a likely invariant to be significant when 
it relates data stored in the user session with data that 
is used to query a database. Capturing this type of re- 
lationships is important because both the user session 
object and the database are the primary mechanism to 
store (persistent) information related to the logical state 
of the application. Moreover, we do not allow any arbi- 
trary relationships: instead, we require that the invariant 
be an equality relationship. Such relationships are rarely 
coincidental because, by design, session objects and the 
database often replicate the same data. 

Whenever Waler finds a path through the application 
that violates a significant invariant, it reports a logic 
vulnerability. To implement this technique, the system 
needed to be extended in two ways. First, we instru- 
mented database queries so that the variables used in cre- 
ating SQL queries are captured by Daikon and included 
into the invariant generation process. To this end, for 
each SQL query in the web application, we introduced a 
“dummy” function. The parameters of each function rep- 
resent the variables used in the corresponding database 
query, and the function body is empty. The purpose of 
introducing this function is to force Daikon to consider 
the parameters for invariant generation at the function’s 
exit point. Second, we require a mechanism to iden- 
tify significant invariants. This was done in a straight- 
forward fashion by inspecting equality invariants for the 
presence of variables that are related to the session object 
and database queries. 

To see how the internal consistency technique can be 
used to identify a vulnerability, consider the code shown 
in Figure 4. This figure shows a snippet of code taken 
from the edituserjsp servlet in one of the Jebbo applica- 
tions (see Section 5)°. The purpose of this servlet is to 
allow users to edit and update their profiles. When the 
user invokes the servlet with a GET request, the applica- 
tion outputs a form, pre-filled with the user’s current in- 
formation. As part of this form, the application includes 
the user’s name in the hidden field username, which 1s 
retrieved from the session object (shown in the upper half 
of Figure 4). When the user has finished updating her in- 
formation, the form is submitted to the same servlet via a 
POST request. When this request is received, the appli- 
cation extracts the name of the user from the username 
parameter and performs a database query (lower half of 
Figure 4). 

For this servlet, the dynamic analysis step (Dai- 
kon) generates the invariant session.username == 
db_query.parameter3, which expresses the fact 
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that a user can only update her own profile. Unfortu- 
nately, it is possible that a malicious client tampers with 
the hidden field username before submitting the form. 
In this case, the profile of an arbitrary user can be mod- 
ified. Waler detects this vulnerability because it deter- 
mines that there exists a path in the program where the 
aforementioned invariant is violated (as the parameter 
username is not checked by the code that handles the 
POST request). Since this invariant is considered signif- 
icant, a logic flaw is reported. 

The idea of checking the consistency of parameters to 
database queries can be further extended to also take into 
account the fields of the database that are affected by a 
query, but that do not appear explicitly in the query’s pa- 
rameters. Consider, for example, a message board ap- 
plication that allows users to update their own entries. 
It is possible that the corresponding database query uses 
only the identifier of the message entry to perform the 
update. However, when looking at the rows that are af- 
fected by legitimate updates, one can see that the name of 
the owner of a posting is always identical to the user who 
performs the update. To capture such consistency invari- 
ants, we extended the parameters of the “dummy” func- 
tion to not only consider the inputs to the database query 
but to also include the values of all database fields that 
the query affects (before the query is executed). When 
multiple database rows are affected, the “dummy” func- 
tion is invoked for each row, allowing Daikon to capture 
aggregated values of fields. 

By extending the “dummy” function as outlined pre- 
viously, Daikon can directly generate invariants that in- 
clude fields stored in the database, even when these fields 
are not directly specified in the query parameters. Again, 
we consider invariants as significant if they introduce an 
equality relationship between database contents and ses- 
sion variables. The intuition is that these invariants im- 
ply a constraint on the database contents that can be ac- 
cessed/modified by the query. If it was possible to violate 
such invariants, an attacker could modify records of the 
database that should not be affected by the query. 


For example, this allows us to detect vulnerabilities 
where an attacker can modify the messages of other users 
in the Jebbo application. Consider the doPost func- 
tion shown in Figure 5. The problem is that an au- 
thenticated user is able to edit the message of any other 
user by simply providing the application with a valid 
message id. During the dynamic analysis, the invari- 
ant db.posts_author == session. auth is gener- 
ated, even though the post s_author field is not used 
as part of the update query. During model checking, we 
determine that this invariant can be violated (and report 
an alert) because there is no check on the id parameter 
that would enforce that only the messages written by the 
current user can be modified. 


19th USENIX Security Symposium 


4.3.3 Vulnerability Reporting 


For each detected bug, Waler generates a vulnerability 
report. This report contains the likely invariant that was 
violated, the program point where this invariant belongs 
to, and the path on which the invariant was violated 
(given as a sequence of servlets and corresponding meth- 
ods that were invoked). This information makes it quite 
easy for a developer or analyst to verify vulnerabilities. 
Currently, vulnerabilities are simply grouped by program 
points. Given the low number of false positives, this al- 
lows for an effective analysis of all reports. However, not 
every alert generated by Waler currently maps directly 
to a vulnerability or a false positive. We found several 
situations where several invariant violations referred to 
the same vulnerabilities (or a false positives) in applica- 
tion code. For example, Waler generated several alerts in 
situations when (conceptually) the same invariant is vi- 
olated at different program points or when two distinct 
invariants refer to the same application’s concept. Find- 
ing better techniques to aggregate and triage reports in 
such situations is an interesting topic of research, which 
we plan to investigate in the future. 


4.3.4 Limitations 


Our approach aims at detecting logic vulnerabilities in 
a general, application-independent way. However, the 
current prototype version of Waler has a number of lim- 
itations, many of which, we believe, can be solved with 
more engineering. First, the types of vulnerabilities that 
can be identified by Waler are limited by the set of 
currently-implemented heuristics. For example, if an ap- 
plication allows the user to include a negative number of 
items in the shopping cart, we would be able to identify 
this issue only if the developer checked for that number 
to be non-negative on at least one program path leading 
to that program point. In addition, this check needs to be 
in a direct if-comparison® between variables. Conditions 
deriving from switch instructions or resulting from com- 
plex operations (such as regular expression matching) are 
not currently implemented. 

Another limitation stems from the fact that we need a 
tool to derive approximations of program specifications. 
As a result, the detection rate of Waler is bounded by the 
capabilities of such a tool. In the current implementation, 
we chose to use Daikon. While Daikon is able to derive a 
wide variety of complex relationships between program 
variables, it has a limited support for some complex data 
structures. For example, if the isAdmin flag value is 
stored in a hash table, and it is not passed as an argument 
to any application function, Daikon will not be able to 
generate invariants based on that value. This limitation 
could be improved by implementing a smarter explo- 
ration technique for complex objects and/or by tracing 
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local and temporary variables for the purpose of likely 
invariant generation. However, care needs to be exer- 
cised in this case to avoid an explosion in the number of 
invariants generated. 

Another issue that we faced when working with Dai- 
kon was scalability: in its current implementation, Dai- 
kon creates a huge data structure in main memory when 
processing an execution trace. As a result, using Daikon 
on a larger application requires a large amount of RAM. 
We worked around this limitation by partitioning the ap- 
plication into subsets of classes and by performing the 
likely invariant generation on each subset separately. 

A more import limitation of Daikon is that invariants 
generated by the tool cannot capture all possible rela- 
tions. For example, the currently supported by Daikon 
invariants do not directly capture such temporal relations, 
as “operation A has to precede operation B.” To address 
these limitations, different “intended behavior” capturing 
tools (such as [1]) could be employed by Waler in the 
first step of the analysis, although we leave this research 
direction for future work. 

Another, more general, limitation of the first step of 
our analysis is the fact that we need to exercise the ap- 
plication in a “normal” way (i.e., not deviating from the 
developer’s intended behavior). This part cannot be fully 
automated and needs human assistance. Nevertheless, 
many tools exist to ease the task of recording and script- 
ing browsing user activity, such as Selenium [31]. 

Finally, the state explosion problem is one of the main 
limitations of the chosen model checking approach. We 
have already described several heuristics that help Waler 
limiting the state space of an application, and currently, 
we are working on implementing a combination of con- 
crete and symbolic execution techniques to further im- 
prove scalability. 


5 Evaluation 


We evaluated the effectiveness of our system in detecting 
logic vulnerabilities on twelve applications: four real- 
world applications, (namely, Easy JSP Forum, JspCart'’, 
GIMS and JaCoB), which we download from the Source- 
Forge repository [28], and eight servlet-based applica- 
tions written by senior-level undergraduate students as 
part of a class project, named Jebbo. When choosing 
the applications, we were looking for the ones that could 
potentially contain interesting logic vulnerabilities, were 
small-enough to scale with the current prototype of Wa- 
ler, and did not use any additional frameworks (such as 
Struts or Faces). While we show that it is possible to 
scale Waler to real-world applications, its scalability is 
still a work in progress as it is based on two tools, JPF 
and Daikon, that were not designed to work on large ap- 
plications. 
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All chosen applications were analyzed following the 
techniques introduced in Section 4. During the model 
checking phase, we explored paths until a depth of 6 (that 
is, the limit for the depth-first search of JPF was set to 6). 
Note that all vulnerabilities reported below were found at 
depth of three or less; we then doubled the search depth 
to let Waler check for deeper bugs. All tests were per- 
formed on a PC with a Pentium 4 CPU (3.6 GHz) and 2 
Gigabytes of RAM. 

The results of our analysis are shown in Table 1. Wa- 
ler found 29 previously-unknown vulnerabilities in four 
real-world applications and 18 previously-unknown vul- 
nerabilities in eight Jebbo applications. It also produced 
a low number of false positives. In Table 1, the columns 
Lines of Code and Bytecode Instructions show the size of 
the applications in terms of the number of lines of Java 
code (JSP pages were first compiled into their servlet 
representations) and of the number of bytecode instruc- 
tions, respectively. The column Entry Points shows how 
many entry points were found and analyzed by Waler and 
the column States Explored shows how many states were 
covered. The columns Likely Invariants and Invariants 
Violated respectively show how many invariants were 
generated by Daikon and how many of them were re- 
ported as violated by Waler. The numbers in the column 
Alerts represent the (manual) aggregation of the reported 
invariants violations (as it is discussed in Section 4.3.3). 
The columns Vulnerabilities, Bugs, and False Positives 
show the aggregated number of vulnerabilities, security- 
unrelated bugs, and false alarms that were produced by 
Waler. Note that the numbers on these columns are based 
on the analysis of the aggregated alerts. Finally, the col- 
umn Running Time shows the time required for the anal- 
ysis. 


5.1 Vulnerabilities 


Easy JSP Forum: The first application that we ana- 
lyzed is the Easy JSP Forum application, a community 
forum written in JSP. Using Waler, we found that any 
authenticated user can edit or delete any post in a fo- 
rum. To enforce access control, the Forum application 
does not show a “delete” or “edit link” for a post if the 
current user does not have moderator’s privileges for the 
current forum but fails to check these privileges when 
a delete or an edit request 1s received. Thus, if a user 
forges a delete/edit request to the application using a 
valid post id (all ids can be obtained from the source 
code of web pages accessible by all users), a post will 
be deleted/modified. 

GIMS: The second application that we analyzed is the 
Global Internship Management System (G/MS) web ap- 
plication, a human resource management software. Us- 
ing Waler, we found that many of the pages in the ap- 
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Application Lines Bytecode Entry States Likely Invariants || Alerts | Vulne- | Bugs False Runtime 
of Code | Instructions | Points | Explored | Invariants | Violated = Positives (min) 


[Easy SP Form || 2416 | 7348 | 2 | m1es7 | 38 | 6 (| 3 


GIMS__||_6.153_| 11,269 |__| 36,228 


i a 


[_taCoB || 8924_| 15,129 [38 | 26809 | e1m32_ [0 || 0 | 0 [0] 0 | 
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Table 1: Experimental results. 


plication do not have sufficient protection from unautho- 
rized access. In particular, our tool correctly identified 
14 servlets that can be accessed by an unauthenticated 
user (a user that is not logged in at all). Most of these 
pages do contain a check that ensures that there is some 
user data in a session (which is only true for authenti- 
cated users). When a check fails, the application gener- 
ates output that redirects the client’s browser to a login 
page. Unfortunately, at this point, the application does 
not stop to process the request due to a missing return 
statement. Moreover, we found that certain pages in the 
GIMS application that should only be accessible to users 
with administrative privileges do not have checks to con- 
firm the role of the current user. As a result, nine admin- 
istrative pages were correctly reported as vulnerable. 


JaCoB: The third application is JaCoB, a community 
message board application that supports posting and 
viewing of messages by registered users. For this pro- 
gram, our tool neither found any vulnerabilities nor did 
it generate any false alert. However, closer analysis of 
the application revealed two security flaws, which could 
not be identified with the techniques used by Waler. For 
example, when a user registers with the message board or 
logs in, she is expected to provide a username and a pass- 
word. Unfortunately, when this information is processed 
by the application, the password is simply ignored. Also, 
in this application, a list of all its users and their private 
information is publicly available. These two problems 
represent serious security issues; however, they cannot 
be detected by Waler because the program specification 
that can be inferred from the application’s behavior does 
not contain any discrepancies with respect to the appli- 
cation’s code. 


JspCart: The fourth test application is JspCart, which is 
a typical online store. Waler identified a number of pages 
in its administrative interface that can be accessed by 
unauthorized users. In JspCart, all pages available to an 
admin user are protected by checks that examine the User 
object in the session. More precisely, the application ver- 
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ifies that a user is authenticated and that the user has ad- 
ministrative privileges. However, Waler found that four 
out of 45 pages are missing the second check. Therefore, 
any user that has a regular account with the store can ac- 
cess administrative pages and add, modify, or delete set- 
tings (e.g., the processing charge for purchases). A sim- 
plified version of one of these vulnerabilities is shown 
in Figure 3. Waler also found a logic vulnerability that 
allows an authenticated user to edit the personal informa- 
tion of another user by submitting a valid email address 
of an existing user. This vulnerability is similar to the 
one shown in Figure 4. 


Jebbo: We analyzed a set of eight Jebbo applications that 
were written by senior-level undergraduate students as a 
class project. Jebbo is a message board application that 
allows its users to open accounts, post public messages, 
and update their own messages and personal information. 
Some of the applications also implement a message rat- 
ing functionality. For this project, all students were pro- 
vided with a description of the application to implement 
along with a set of rules (including security constraints) 
that were expected to be enforced by the application. 


After running Waler on this set of applications, we 
found that six out of eight applications contained one or 
more logic flaws. Examples of the vulnerabilities found 
by Waler include the fact that unauthenticated users can 
post a message to the board, and the lack of authoriza- 
tion checks when users rate an existing message (e.g., in 
order to avoid for a user to rate its own messages). [ron- 
ically, most of the student followed the provided specifi- 
cation carefully and were checking that access to certain 
pages is limited to authenticated users only; however, due 
to various mistakes, the enforcing checks were not al- 
ways sufficient. For example, common problems that we 
found are missing return statements on an error path and 
a failure to foresee all possible paths available to a user 
to access a certain functionality. 


Waler identified a number of application logic flaws 
that are associated with unauthorized data modification, 
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such as the possibility to edit personal information or 
posts belonging to another user. Some of the examples of 
these vulnerabilities are shown in Figure 4 and Figure 5. 
These vulnerabilities are classic examples of inconsis- 
tent usage of data by the application. It is interesting to 
observe that even though the students were aware of pos- 
sible parameter tampering vulnerabilities, and, in many 
cases, they were very careful about checking user input 
for validity, they often failed to apply this knowledge to 
cases where there were multiple paths to the same pro- 
gram point. 

The results for the Jebbo application demonstrate that 
logic flaws are hard to avoid, even in simple web appli- 
cations. Almost all applications in this set were found 
to be vulnerable despite the fact that the students were 
given a clear program specification and knew basic web 
security practices. Given the class level of the students 
who were enrolled in the class, it is reasonable to assume 
that their programming skills are not far off from those 
of entry-level programmers. This, together with the fact 
that the complexity of real-world applications is much 
higher than the complexity of the Jebbo application, can 
be seen as an indication of how wide-spread web appli- 
cation logic flaws are. Moreover, it can be argued that 
many real-world application are, at least partially, writ- 
ten by students who are widely employed year-round as 
interns. 


5.2 Discussion 


As it is shown in Table 1, Waler generated a low number 
of false positives. Careful analysis of the alerts which did 
not represent a vulnerability revealed that the majority 
of them represent true weaknesses in code. These alerts 
were classified as bugs. We found that these bugs were 
either potential vulnerabilities that turned out to be unex- 
ploitable in particular situations or were not interesting 
for exploitation. For example, an unauthenticated user 
might be able to access a certain page, but this access 
does not contain any sensitive information. We classified 
the rest of the alerts as false positives. 


We also carefully analyzed the applications for false 
negatives. We found that Waler missed some security 
problems, like the ones in JaCoB, but we consider these 
vulnerabilities to be out of scope as they cannot be de- 
tected using our approach. We also identified several 
cases where Waler missed vulnerabilities that should be 
detectable using the described approach. The main rea- 
son for such false negatives is the incomplete modeling 
of all application features in the current version of Waler. 
For example, Waler only identifies program checks in the 
form of if-statements, but in real applications, checks can 
be implemented using, for instance, database queries and 
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regular expressions. Precise modeling of such constructs 
is left for future work. 

The other way to evaluate the false negatives rate of 
Waler would be to run it on an application that has some 
known logic vulnerabilities. Unfortunately, we found a 
very limited number of such applications to be available, 
and none of them met all of our current selection criteria 
for test applications. 


6 Related Work 


Our work is related to several areas of active research, 
such as deriving application specifications, using specifi- 
cations for bug finding, and vulnerability analysis of web 
applications. However, due to the limited space avail- 
able, in this section we will only highlight the research 
that, in our opinion, is most related. 

First, our approach is related to a number of ap- 
proaches that combine dynamically-generated invariants 
with static analysis. For example, Nimmer and Ernst ex- 
plore how to integrate dynamic detection of program in- 
variants and their static verification on a set of simple 
stand-alone applications using Daikon and the ESC/Java 
static checker [27]. The invariants that are verified by 
the static checker on all paths are determined to be the 
real invariants for an application, and the invariants that 
could not be statically verified are shown as warnings 
to the user. The main goal of this research is to show 
the feasibility of the proposed approach rather than to 
find bugs. Another work that explores benefits of com- 
bining Daikon-generated invariants with static analysis 1s 
the DSD-Crasher tool by Csallner and Smaragdakis [8]. 
The main goal of this system is to decrease the false pos- 
itives rate of a static bug-finding tool for stand-alone Java 
applications. Dynamically-generated invariants are used 
by the CnC tool (also based on ESC/Java) as assump- 
tions on methods arguments and return values to narrow 
the domain searched by the static analyzer. In Waler, in 
contrast to both approaches, we do not assume that the 
invariants generated by Daikon are correct, and we only 
consider them to be clues for vulnerability analysis. In- 
troducing our two additional techniques to differentiate 
between real and spurious invariants allows us to avoid 
many of the false positives due to limitations of the dy- 
namic analysis step. 

Our work is also related to the research on using an 
application’s code to infer application-specific properties 
that can be used for guided bug finding. To the best of our 
knowledge, one of the first techniques that uses inferred 
specifications to search for application-specific errors is 
the work by Engler et al. [10]. Their goal is similar to 
ours in the sense that both works are trying to identify vi- 
olations of likely invariants in applications. The way it is 
achieved, though, is very different in the two approaches. 
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While we infer specifications from dynamic analysis and 
check for possible violations in the code via symbolic ex- 
ecution, Engler’s work carries out all the steps via static 
analysis: a set of given templates is used to extract a set 
of “beliefs” from the code. Afterward, patterns contra- 
dicting these “beliefs” are identified in the code. While 
some of the templates may be useful for web applica- 
tions, most of the bugs they try to identify are relative to 
kernel and memory-unsafe programming languages op- 
erations. Moreover, we believe that having an additional 
source of information (i.e., dynamic traces) for applica- 
tion invariants makes our system more robust. 


There is also recent work that uses statistical analysis 
and program code to learn certain properties of the appli- 
cation, with the goal of searching for application-specific 
bugs. For example, Kremenek et al. propose a statistical 
approach, based on factor graphs, to automatically infer 
which program functions return or claim ownership of 
a resource [21]. The AutoISES tool applies the idea of 
using statically-inferred specifications to the detection of 
vulnerabilities in the implementations of access control 
mechanisms for OS-level code [34]. The differences be- 
tween these approaches and ours are similar to the ones 
with the Engler’s work. Both approaches use statistical 
analysis to find violations of properties that must hold 
for all program points, and they do not require reasoning 
about the values of variables. 


Learning invariants through dynamic analysis has al- 
ready found application security purposes, mostly in or- 
der to train an Intrusion Detection System. Baliga et 
al. [4] employ Daikon to extract invariants on kernel 
structures from periodic memory snapshot of a non- 
compromised running system. After the training phase, 
these learned invariants are used to detect the presence of 
kernel rootkits that may have altered vital kernel struc- 
tures. A conceptually similar approach has also been ap- 
plied by Bond et al. [6] to Java code through instrumen- 
tation of the Java Virtual Machine. An initial learning 
phase is employed to record the calling context and call 
history for security-sensitive functions. Afterwards, the 
collected information is used to identify function invoca- 
tions with an anomalous context. An anomalous context 
or history is considered an indicator of an attempt to di- 
vert the intended flow of the application, possibly by the 
exploitation of a logic error in the code. In that case, an 
alert is triggered or the execution is aborted. 

Although both the techniques proposed by Baliga and 
Bong share with ours an initial dynamic learning phase, 
how the information is leveraged differs. For example, 
unlike the two approaches above, we do not assume that 
the likely invariants generated by the first phase are real 
invariants, rather we simply use them as hints for further 
analysis. In addition, while in our second phase we try to 
identify logic errors in the code by means of static anal- 
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ysis, they instead try to detect attacks being performed 
on a live system. Such run-time detection imposes an 
overhead, which results in the requirement for dedicated 
hardware for [4] and a 2%-9% penalty in performance 
for [6]. The authors of the latter work, in particular, 
traded some coverage of the code (limiting to security- 
related functions) in order to retain acceptable perfor- 
mance. Even though they focused on logic errors, a di- 
rect comparison with their evaluation environment was 
not possible, because of the different targets of the anal- 
ysis. More precisely, they looked for bugs in the Java 
libraries triggered by Java applets, rather than bugs in 
Java-based web applications. 

Another direction of research deals with protection of 
web service components against malicious and/or com- 
promised clients. Guha et al. [15] employ static anal- 
ysis on JavaScript client code in order to extract an ex- 
pected client behavior as seen by the server. The server is 
then protected by a proxy that filters possibly malicious 
clients which do not conform to the extracted behavior. 

Finally, our work is related to a large corpus of work, 
such as [16, 5, 7, 17, 18, 22, 26, 30, 33, 36, 23, 29], in the 
area of vulnerability analysis of web applications. How- 
ever, most of these research works focus on the detec- 
tion of or the protection against input-validation attacks, 
which do not require any knowledge of application- 
specific rules. 

Among the approaches cited above, Swaddler [7] and 
MiMoSA [5] are tools developed by our group that look 
for workflow violation attacks in PHP-based web appli- 
cations, using a number of different techniques (includ- 
ing Daikon-generated invariants). However, Waler’s ap- 
proach is more general and is able to identify any kind of 
a policy violation that is either reflected by a check in the 
application or that violates a consistency constraint. 

Our work is also related to the QED tool presented 
in [23]. QED uses concrete model checking (with a set 
of predefined concrete inputs) to identify taint-based vul- 
nerabilities in servlet-based applications. The main sim- 
ilarity between the two tools is that they both use a set 
of heuristics to limit an application’s state space during 
model checking. Heuristics used by QED, however, are 
more specific to the taint-propagation problem and re- 
quire an additional analysis step. 


7 Conclusions 


In this paper, we have presented a novel approach to the 
identification of a class of application logic vulnerabil- 
ities, in the context of web applications. Our approach 
uses a composition of dynamic analysis and symbolic 
model checking to identify invariants that are a part of the 
“intended” program specification, but are not enforced 
on all paths in the code of a web application. 
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We implemented the proposed approaches in a tool, 
called Waler, that analyzes servlet-based web applica- 
tions. We used Waler to identify a number of previously- 
unknown application logic vulnerabilities in several real- 
world applications and in a number of senior undergrad- 
uate projects. 

To the best of our knowledge, Waler is the first tool 
that is able to automatically detect complex web appli- 
cation logic flaws without the need for a substantial hu- 
man (annotation) effort or the use of ad hoc, manually- 
specified heuristics. 

Future work will focus on extending the class of ap- 
plication logic vulnerabilities that we can identify. In ad- 
dition, we plan to extend Waler to deal with a number of 
frameworks, such as Struts and Faces. This will require 
creating “symbolic” versions of the libraries included in 
these frameworks. This initial development effort will 
allow us to apply our tool to a much larger set of web ap- 
plications, since most large-scale, servlet-based web ap- 
plications rely on one of these popular frameworks, and 
the lack of their support in Waler was a serious limit- 
ing factor when choosing real-world applications for the 
evaluation described in this paper. 
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Notes 


lAsa consequence of that, JPF includes constraints that are no 
longer relevant to the current execution into the application’s state, pre- 
venting it from detecting otherwise equivalent states. 

*Note that by using the simple strategy of removing all constraints 
that reference no longer live variables, we might potentially lose some 
of the implied constraints in the PC. This can reduce the effectiveness 
of the reduction of the state space, but it does not interfere with the 
soundness of the analysis. 

3The names of the variables are generated as explained in Sec- 
tion 4.1. 

*When session data is accessed on a path, the PCA records that 
fact, along with the key that was used. This is done by storing the 
item session.<key> in an attribute of the memory location that holds 
the reference to the object. The information is then propagated by JPF 
with each bytecode instruction that accesses this memory location. 

>A similar vulnerability was found by Waler in the JspCart applica- 
tion. We use Jebbo as a simpler example. 

Note that our tool works on Java bytecode rather than source code. 
Therefore, loop exit conditions are implicitly included, as they are im- 
plemented in terms of /F opcodes. 

’The code for the JspCart application is located in the SourceForge 
repository under the name B2B eCommerce Project. 
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Abstract 


Maintaining correct access control to shared resources 
such as file servers, wikis, and databases is an important 
part of enterprise network management. A combination 
of many factors, including high rates of churn in organi- 
zational roles, policy changes, and dynamic information- 
sharing scenarios, can trigger frequent updates to user 
permissions, leading to potential inconsistencies. With 
Baaz, we present a distributed system that monitors up- 
dates to access control metadata, analyzes this informa- 
tion to alert administrators about potential security and 
accessibility issues, and recommends suitable changes. 
Baaz detects misconfigurations that manifest as small in- 
consistencies in user permissions that are different from 
what their peers are entitled to, and prevents integrity and 
confidentiality vulnerabilities that could lead to insider 
attacks. In a deployment of our system on an organiza- 
tional file server that stored confidential data, we found 
10 high level security issues that impacted 1639 out of 
105682 directories. These were promptly rectified. 


1 Introduction 


In present-day enterprise networks, shared resources 
such as file servers, web-based services such as wikis, 
and federated computing resources are becoming in- 
creasingly prevalent. Managing such shared resources 
requires not only timely availability of data, but also cor- 
rect enforcement of enterprise security policies. 

Ideally, all access should be managed through a per- 
fectly engineered role-based access control (RBAC) sys- 
tem. Individuals in an organization should have well- 
defined and precise roles, and access control to all re- 
sources should be based purely on these roles. When 
a user changes her role, her access rights to all shared 
resources should automatically change according to the 
new role with immediate effect. 

In reality though, several organizations use disjoint ac- 
cess control mechanisms which are not kept consistent. 
Often, access is granted to individual users rather than to 
appropriate roles. To make matters worse, administrators 
and resource owners manually provide and revoke access 
on an as-needed and sometimes ad-hoc basis. As access 
requirements and rights of individuals in the enterprise 
change over time, it is widely recognized [19, 12, 5] that 
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maintaining consistent permissions to shared resources 
in compliance with organizational policy is a significant 
operational challenge. 

Incorrect access permissions, or access control mis- 
configurations, can lead to both security and accessibility 
issues. Security misconfigurations arise when a user who 
should not have access to a certain resource according to 
organizational policy, does indeed have access. Accord- 
ing to arecent report [12], 50 to 90% of the employees in 
4 large financial organizations had permissions in excess 
to what was entitled to their organizational role, opening 
a window of opportunity for insider attacks that can lead 
to disclosure of confidential information for profit, data 
theft, or data integrity violations. The 2007 Price Water- 
house Cooper survey on the global state of information 
security found that 69% of database breaches were by 
insiders [24]. On the other hand, accessibility misconfig- 
urations arise when a user who should legitimately have 
access to an object, does not. Such misconfigurations, in 
addition to being annoyances, impact user productivity. 

Security and accessibility misconfigurations occur due 
to several reasons. One contributing factor is the high 
rate of churn in organizations, and in organizational roles 
among existing employees, which necessitate changes 
in access permissions. In the same report [12], it was 
estimated that in one business group of 3000 people, 
1000 organizational changes were observed over a pe- 
riod of few months. Another factor is the dynamic na- 
ture of information sharing workflows, where employ- 
ees work together across organizational groups on short- 
term collaborations. When permissions are granted to 
shared resources for such collaborations, they are rarely 
revoked. In longer time-scales, organizations also update 
their policies in response to changing protection needs. 
Very often, these policies are not explicitly written down 
and system administrators, who have an operational view 
of security, may not have a global view of organizational 
needs, and may not be able to make these changes in a 
timely manner. 

To make matters worse, very often, no complete high- 
level manifests exist, which correctly assign access per- 
missions for a resource according to organizational pol- 
icy. Consequently, given the large numbers of shared re- 
sources, different access control mechanisms and enter- 
prise churn, it is difficult for administrators to manually 
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manage access control. 

To address these limitations of existing access control 
management systems, we present Baaz, a system that 
monitors access control metadata of various shared re- 
sources across an enterprise, finds security and acces- 
sibility misconfigurations using fast and efficient algo- 
rithms, and suggests suitable changes. 

To our knowledge, Baaz is the first system that helps 
an administrator audit access control mechanisms and 
discover critical security and accessibility vulnerabilities 
in access control without using a high-level policy mani- 
fest. To do this, Baaz uses two novel algorithms: Group 
Mapping, which correlates two different access control 
or group membership datasets to find discrepancies, and 
Object Clustering, which uses statistical techniques to 
find slight differences in access control between users in 
the same dataset. 

We do not claim that techniques we use in Baaz will 
find all misconfigurations, as the notion of policy itself is 
not defined in most of our deployment settings. Also, 
given that access permissions change very organically 
over time and several of these changes are linked to ad- 
hoc and one-off access requirements, it is very difficult 
for an automated system to deduce the exact and com- 
plete list of all misconfigurations. However, our deploy- 
ment experiences with real datasets have shown Baaz to 
be very effective at flagging high-value security and ac- 
cessibility misconfigurations. 

The operational context and main characteristics of 
Baaz are: 


e No assumption of well-defined policy: Baaz does 
not require a high-level policy manifest, though it 
can exploit one if it exists. Rather than checking for 
“correct” access control, it checks for “‘consistent”’ 
access control by comparing users’ access permis- 
sions and memberships across different resources. 


e Proactive vs Reactive: Baaz takes as input static 
permissions, such as access control lists, rather than 
access logs. This approach helps fix misconfigura- 
tions before they can be exploited, reducing chances 
of insider attacks. However, the system can be eas- 
ily augmented to process access logs if required. 


e Timeliness: Baaz continuously monitors access 
control, so it can be configured to detect and report 
misconfigurations on sensitive data items as they 
occur, or just present periodic reports for less sensi- 
tive data. 


We present results from Baaz deployments on three 
heterogeneous resources across two organizations, We 
interacted with system administrators of both organiza- 
tions to validate the reports and found a number of high- 
value security and accessibility misconfigurations, some 
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of which were fixed immediately by the respective sys- 
tem administrators. In all these organizations, no pol- 
icy manifest was readily available. Before we deployed 
Baaz, these administrators had to examine thousands of 
individual or group permissions to validate whether these 
permissions were intended. The utility of Baaz can be 
gauged to some extent from some comments we received 
from administrators: 


“This report is very useful. I didn’t even know 
these folks had access!” 


“This output tells me how many issues there 
are. Now I HAVE to figure out what to do in 
the future to handle access control better.” 


“IT did not realize that our policy change had 
not been implemented!” 


Our Baaz deployment in one organization found 10 se- 
curity and 8 accessibility misconfigurations in confiden- 
tial data stored on a shared file server. The security mis- 
configurations were providing 7 users unwarranted ac- 
cess to 1639 directories. 

The rest of the paper is organized as follows: Section 2 
describes our problem scope and assumptions. Section 3 
presents the system architecture of Baaz, as well as an 
overview of our algorithm workflow. Section 4 explains 
our Matrix Reduction procedure for generating summary 
statements and reference groups, followed by Sections 5 
and 6, in which we present our Group Mapping and Ob- 
ject Clustering algorithms. In Section 7, we outline more 
detailed issues we encountered while designing the sys- 
tem, and in Section 8, we describe our implementation, 
deployment and evaluation of the Baaz prototype. Re- 
lated work is presented in Section 9, and Section 10 sum- 
marizes the paper. 


2 System Assumptions 


The main goal of Baaz is to find misconfigurations in ac- 
cess control permissions (as in ACLs) typically caused 
by inadvertent misconfigurations, which are difficult for 
an administrator to detect and rectify manually. We 
do not detect misconfigurations of access permissions 
caused by manipulation by active adversaries. We as- 
sume that the inputs to our tool, such as the ACLs and 
well-known user groups, are not tampered. In many or- 
ganizations, only administrators or resource owners will 
be able to view and modify these metadata in the first 
place, so this assumption is reasonable. 

In our target environment, a definition of correct pol- 
icy 1s not explicitly available. Therefore, rather than 
checking for correct access control, which we believe is 
difficult, the system checks for consistent access control. 
Essentially, Baaz finds relatively small inconsistencies in 
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Figure 1: Baaz System Architecture 


user permissions by comparing different sets of access 
control lists, or by comparing user permissions within 
the same access control list. We assume that large differ- 
ences in access control are not indicative of misconfig- 
urations. Clearly, our definition of small inconsistencies 
and large differences (provided in Sections 5 and 6) will 
govern the set of misconfigurations we find. It is possi- 
ble that this may lead to the system missing some gen- 
uine problems which is an inherent limitation. In fact, as 
described in Section 8.2, our deployment of Baaz missed 
detecting some valid misconfigurations. However, ad- 
ministrators can tune these parameters to keep the output 
concise and useful. 


3 System Overview 


In this section, we present an overview of the system 
components of Baaz. At the heart of our system, as 
shown in Figure 1, is a central server that collects ac- 
cess permission and membership change events from dis- 
tributed stubs attached to shared resources. This server 
runs the misconfiguration detection algorithm when it re- 
ceives these change events, and generates a report. An 
administrator/resource owner can decides whether each 
misconfiguration tuple that Baaz reports 1s valid, invalid, 
or an intentional exception. Administrators/owners will 
need to fix the valid misconfigurations manually. We 
now provide an overview of the client stubs and server 
functions. 


3.1 Baaz Client Stubs 


Baaz stubs continuously monitor access control permis- 
sions on shared resources such as file servers, wikis, 
version-control systems, and databases, and they monitor 
updates to memberships in departmental groups, email 
lists, etc. Each stub translates the access permissions for 
a shared resource into a binary relation matrix, an ex- 
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ample of which is shown in Figure 2. Each such matrix 
captures relations specific to the resource that the stub 
runs on. For example, a file server stub captures the user- 
file access relationship, relating which users can access 
given files. On a database that stores organizational hi- 
erarchy, the Baaz stubs capture the user-group member- 
ship relation, relating which users are members of given 
groups. We shall refer to an element in the relation ma- 
trix M as M;,;. A “1” in the 7’” row and the j*” column 
of M indicates the relation holds between the entity at 
row 7 with the entity at column 7, e.g., user 2 can read file 
7, or user 2 belongs to group 7, whereas a “0” indicates 
that the relation does not hold. 

Each Baaz stub sends M;,; to the Baaz server either 
periodically, or in response to a change in the relation- 
ship. Section 7.2 further describes various issues that 
we need to consider while designing and implementing 
stubs. 


3.2. Baaz Server 


At initial setup, an administrator registers pairs of sub- 
ject datasets and reference datasets with the server, 
which form inputs to the server’s misconfiguration detec- 
tion algorithm. The subject dataset is the access control 
dataset which an administrator wants to inspect for mis- 
configurations. A reference dataset is a separate access 
control or group membership dataset that Baaz treats as 
a baseline against which it compares the subject. In a 
sense, one can view the subject dataset as the implemen- 
tation, and the reference dataset as an approximate pol- 
icy, and the process of misconfiguration detection com- 
pares the implementation with the approximate policy. 
Figure 2 shows an example subject dataset relation 
matrix of ten users (labeled as A to J) and 16 objects 
(labeled as 1 to 16), and Figure 3 shows an example ref- 
erence dataset relation matrix of the same set of users 
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Figure 2: Example subject dataset’s relation matrix 


and 4 groups (labeled as W to Z). We will use these ex- 
ample inputs to illustrate our misconfiguration detection 
algorithm. 


Administrators can register multiple subject-reference 
pairs with the server, and each pair is processed inde- 
pendently, with the server periodically generating one 
misconfiguration report for each. If any changes are de- 
tected in matrices corresponding to a registered subject- 
reference pair, the server runs the misconfiguration de- 
tection algorithm, which has three steps: 


Matrix Reduction: In the first step, the server re- 
duces the subject and reference datasets’ relation matri- 
ces to summary statements that capture sets of users that 
have similar access permissions and group memberships. 
Each summary statement can be thought of as a high- 
level statement of policy intent, gleaned entirely from the 
low-level relation matrices. We explain this procedure in 
Section 4. 


Group Mapping: In this step, our goal is to uncover 
access permissions in the subject dataset that seem in- 
consistent with patterns in the reference dataset. Con- 
sider an example where the subject is a file server, and 
a reference is a list of departmental groups, as shown in 
Figure 1. Say a directory hierarchy on the file server can 
be accessed by all members in the human resources de- 
partment in an organization, and by only one member of 
the facilities department. This has a high likelihood of 
being a security misconfiguration. Section 5 explains 
this procedure. 


Object Clustering: Finally, in the Object Clustering 
phase, Baaz finds potential inconsistencies in the subject 
dataset by comparing summary statements for the sub- 
ject that are “similar’’, but not the same. The main idea is 
that a user whose access permissions differ only slightly 
from that of a larger set of users could potentially be a 
misconfiguration. For example, if 10 users in the subject 
dataset can access a given set of 100 files, but say an 11th 
user can access only 99 of these files, Baaz flags a candi- 
date accessibility misconfiguration. We describe this in 
Section 6. 


The system reports security candidates as “A user set 
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Figure 3: Example reference dataset’s relation matrix 


U MAY NOT need access to object set O” . Accessibility 
candidates are of the form “A user set U MAY need ac- 
cess to object set O” At this point, the administrator will 
need to identify reported misconfiguration candidates as 
‘valid’, “invalid”, or “intentional exceptions”, which are 
defined as follows. 


Valid: The misconfiguration candidate is correct, and the 
administrator needs to make the recommended changes. 


Invalid: The misconfiguration candidate is incorrect, 
and the administrator should not make the recommended 
changes. 


Intentional Exception: The administrator should not 
make the recommended changes, but the candidate pro- 
vides useful information to the administrator. 


The intentional exception category captures all re- 
ported misconfigurations that correspond to exceptions 
which appear out of the ordinary but are legitimate. Ad- 
ministrators found these exceptions to be useful as they 
help check compliance and may, over time, become valid 
misconfigurations. An example of an intentional excep- 
tion is a user who has just changed roles. To help with 
the transition, he still has access to some documents re- 
lated to his previous role. Hence while his access should 
not be revoked at the current time, it should probably be 
in the near future. 


The server archives candidates marked as invalid, and 
does not explicitly display them in future reports. The re- 
ports will, however, display intentional exceptions. Sec- 
tion 7.1 describes more specific issues related to server 
design and evaluation. 


One of the important properties of our algorithms is 
that the misconfiguration candidates converge to a steady 
state. That is, if we run our Group Mapping and Ob- 
ject Clustering algorithms repeatedly starting from a 
given raw configuration, and if we resolve our miscon- 
figurations as suggested, we will eventually (and fairly 
quickly) reach a state where no new candidates appear. 
This guarantee is what we call internal consistency. We 
will illustrate this through our examples in Sections 4 and 
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Subject Dataset Summary Statements 


1. {C, D} > {15,16} 

3 C.D, E, FG} 6, 7} 

3. {A, B,C, D} > {9, 10, 11, 12} 

AAA B.C.DIy ia 

5. {C, D, E, F,G, H} = {1, 2, 3,4, 5} 


) ) ) 


Figure 4: The result of the matrix reduction step on our 
example subject dataset’s matrix. 


5. The detailed proof is available on our webpage !. In 
the next three sections, we describe the server algorithm 
in detail. 


4 Matrix Reduction 


We apply the matrix reduction procedure on the rela- 
tion matrices of both the subject and reference datasets. 
The goal of this step, in the context of the subject 
dataset, is to find summary statements relating sets of 
users (user-sets) that can access the same sets of ob- 
jects (object-sets). Given a relation matrix, different 
kinds of summaries can be generated. Role mining al- 
gorithms [22, 25, 18, 28, 10], for example, try to find 
minimal overlapping sets of users and objects that have 
common permissions. In contrast, we find user-sets that 
have access to disjoint object-sets, as required by our 
misconfiguration detection algorithms. For the reference 
dataset, we find group membership summaries in a simi- 
lar manner. 


4.1 Subject Dataset 


Our algorithm takes the relation matrix for the subject 
dataset as input, and examines each column, grouping 
together all objects that have identical column vectors. 
Essentially, it groups all objects that are accessible to an 
identical set of users. 

Figure 4 shows the summary statements that our Ma- 
trix Reduction algorithm finds for the example shown 
earlier in Figure 2. Each greyscale coloring within the 
matrix represents a distinct summary statement. The list 
of summary statements that our algorithm yields is also 
shown in the figure. The first statement arises from users 
C’ and D having identical access rights, since they both 


'http://research.microsoft.com/baaz 
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Reference Dataset Summary Statements 


i GyetC, Dib FC HI) Sax 
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Figure 5: The result of the matrix reduction step on our 
example reference dataset’s matrix. 


have access to objects 15 and 16, and to no other object. 
We therefore interpret this in the following way: Users 
C’ and D have exclusive access to objects 15 and 16, i.e. 
no other user has access to these objects. 

The Baaz server finds all such summary statements to 
completely capture the matrix. Next, it explicitly filters 
out all summary statements that involve only one user 
since our algorithm only looks for misconfigurations in- 
volving objects that are shared between more than one 
user. Figure 6 presents this algorithm in detail. 
Complexity: Since the algorithm simply involves one 
sweep through the subject’s relation matrix, grouping to- 
gether identical columns, it runs in O(nm) time, where n 
is the number of users in the matrix and m is the number 
of objects. 


EXTRACT SUMMARY STATEMENTS 
Input: M {binary relation matrix of all users U/ and all objects O} 
Output: S {set of summary statements [U;, — O;]} 

Uses: H hashtable, indexed by sets of users, stores sets of ob- 
jects} 

1: S=¢,H=$¢ 

2: for allo € Odo 

3: U = Get User Set(M, 0) // gets the set of users who can 

access O 


4 if H.contains(U) then 
>: Oyu = H.get(U) 

6: H.put(U, Ou J{o}) 
7 else 

8 H.put(U, {o}) 

9 end if 

10: end for 


11: for all U;, € H.keys do 
12: Ox, = H.get(Ux) 

13: S=S\|){[Un > Ox]} 
14: end for 

15: return S 


Figure 6: Algorithm to extract summary statements 
given the users and the access control matrix 
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Figure 7: The result of the Group Mapping algorithm on the example subject matrix. 


4.2 Reference Dataset 


We apply the same process on the matrix for the refer- 
ence dataset. The summary statements that our algo- 
rithm finds for the reference dataset relation matrix are 
shown in Figure 5. We call the user-set in each summary 
statement obtained from the reference dataset a reference 
group. The reference groups for our example are: 

G, = {C,D,E,F,G, H, J} 

G2 = {A,B,C} 

G3 = {C, D } 
The objects W, X,Y, Z are merely used to find the ref- 
erence groups, and are not used by future phases of our 
algorithm. 


5 Group Mapping 

In this section, we describe the Group Mapping algo- 
rithm, that takes as input the user-sets representing the 
subject dataset, and the reference groups discovered from 
the reference dataset, and finds the best mapping from the 
each user-set to the reference groups. The server uses 
these maps to flag outliers (users) as misconfiguration 
candidates. We first explain why Group Mapping is a 
useful step in finding misconfigurations. Next, we ex- 
plain how Group Mapping works on our example data, 
and then we present the algorithm in detail. 


5.1 Algorithm 


Now we describe the Group Mapping algorithm in more 
detail. Table | summarizes the list of symbols and vari- 
ables we use here, and in the description of the Object 
Clustering algorithm. 


5.2 Intuition and Definitions 


The Group Mapping algorithm for finding misconfigura- 
tions relies on the following two assumptions: 


1. Users in the same reference group should have same 
access permissions. 


2. Given a set of reference groups that have the same 
access permissions, any user who is not a member 
of these reference groups should not have the same 
access permissions as users within these reference 
groups. 


Based on these two assumptions, we define misconfig- 
uration candidates for the algorithm to find as follows: 
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Accessibility (based on Assumption 1): If a majority 
of the members of a reference group all have ac- 
cess to a set of objects, and a minority do not have 
access to the same set of objects, then we flag the 
users without access as accessibility misconfigura- 
tion candidates. 


Security (based on Assumption 2): Of all users in a 
user-set, if a majority of them form one or more ref- 
erence groups, and a minority of users do not form 
any reference groups, we flag the minority of users 
as security misconfiguration candidates. 


Following these definitions, the first thing to do is to 
find a mapping from user-sets to reference groups. How- 
ever, since we are looking for outliers, we do not restrict 
the algorithm to finding an exact and complete mapping. 
Our goal is to find the “best-effort” mapping from user- 
sets to reference groups. In this process, some users in 
the user-sets may not map to any reference group, or a 
user-set may map to a reference group that has some ex- 
traneous users, who are not part of the user-set. 

To illustrate with our running example, our Group 
Mapping algorithm maps the five user-sets in the sum- 
mary statements we found in Figure 4 to the reference 
groups found in the Section 4.2 as shown in Figure 7. 
For the user-set of summary statement 1, the mapping is 
exact. For the user-set for statement 2, the best map is 
G1, which covers all users but also includes users H and 
J who are not in the user-set. For the user-set in sum- 
mary statement 4, the best map is Ge, while users D and 
I remain unmapped. 

From this mapping, using the assumptions and defini- 
tions stated above, we infer the following misconfigura- 
tion candidates: 


1. From summary statement 2, users H and J MAY 
need access to objects 6, 7. 


2. From summary statement 3, user D MAY NOT need 
access to objects 9, 10, 11, and 12. 


3. From summary statement 4, users D and I MAY 
NOT need access to object 13. 


4. From summary statement 5, user J MAY need access 
to objects 1, 2, 3, 4, and 5. 
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Symbol 


[~¢ _| number of reference groups from reference dataset SSS 
ft 


summary statement for subject, with U; being the user-set and O; being the object-set 


9°" reference group 
set of groups used to cover user-set U; 
list of uncovered users in user-set U; after covering it by C; 


list of users in G; but not in user-set U;, where G; € C; 





Table 1: Table summarizing all symbols used to explain Group Mapping and Object Clustering 


The second and third are security misconfiguration 
candidates, while the first and fourth are accessibility 
misconfiguration candidates. User-set 1 does not gen- 
erate a misconfiguration candidate because the mapping 
is exact. 


Fixing these misconfigurations will improve the map- 
ping from user-sets to reference groups in future runs of 
the algorithm. For example, if the administrator removes 
user D’s access to objects 9, 10, 11 and 12, the next time 
the algorithm runs, the summary statement 3 will reduce 
to {A,B,C} — {9,10,11,12}. Group mapping will 
exactly map the new user-set to Gg, and hence the num- 
ber of misconfiguration candidates will reduce. This is 
what we mean by our algorithm reaching an internally 
consistent state, as mentioned in Section 3.2. 


Note that in flagging these candidates, we may have 
missed some misconfigurations. For example, it is cer- 
tainly possible that users C’ and D (forming group Gs) 
should not have access to objects 15 and 16. But given 
that there is no definition of correct policy, a complete 
and correct list of misconfigurations cannot be expected. 
However, Baaz does ensure that the permissions are con- 
sistent across user-sets and the reference groups they 
map to. 


Baaz can use role mining algorithms in the Matrix Re- 
duction step to find possibly a larger number of sum- 
mary statements. However, our definitions of miscon- 
figuration and our algorithms hinge on the property of 
object-sets being disjoint, without which the system may 
find conflicting misconfiguration candidates. For ex- 
ample, if summary statement 3 included object 15, 1.e. 
{A, B,C, D} > {9,10, 11,12, 15}, the object 15 would 
be common to the object-sets of summary statements 1 
and 3. Then, from summary statement 3, Group Mapping 
would suggest that D should not have access to object 
15, but the exact Group Map for summary statement 1 
indicates that D should have access to object 15. Hence, 
while Baaz could use role mining algorithms, and lever- 
age richer and larger numbers of user-sets, it would need 
to include more logic to resolve such conflicts. Instead, 
we go with the approach of using the simple Matrix Re- 
duction algorithm that provides object-disjoint user-sets. 
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In spite of its procedural limitations, administrators 
and resource owners in various domains have found 
Baaz’s techniques very useful in finding genuine high- 
value misconfigurations. We show this through our eval- 
uation in Section 8, 


Say the Matrix Reduction step from Section 4 out- 
puts a total of / summary statements and g reference 
groups. The input to the Group Mapping step is the 
set of user-sets U = {U,,U2,---,U,} from the sum- 
mary statements, and the set of reference groups G = 
{Gi, Ge,---, Gg}. Our objective can now be expressed 
in terms of finding a set cover for each user-set U; using a 
subset of the groups in G. A set cover, in its usual sense, 
implies that the union of the covering subsets is exactly 
equal to the set to be covered. But, we are interested in 
finding an approximate set cover, where the cover need 
not be exhaustive, and reference groups could include 
members that are not in the user-set. The idea is to find 
a maximal overlap between the subject dataset user-sets 
and the reference groups. This approximate set cover C; 
may contain a group G; such that some users in G; are 
absent in U;, as shown in Figure 7 with user-sets 2 and 5. 
Also, it is not necessary that C; covers every user in U;, 
as shown with user-sets 3 and 4. We refer to the set of un- 
covered users in U; as Tj, 1.e., is 7; = U; — Uvejec: G;. 


We choose an approximate set cover based on the min- 
imum description length (MDL) principle [11], which en- 
sures that the overlap is large, while the leftover set of 
uncovered users is small. In other words, |C;| + |7j| is 
minimum over all possible approximate set covers. The 
minimum set cover problem is known to be NP-Hard, as 
it can take running time that is exponential on the set of 
users. By the same logic, the problem of finding approx- 
imate set cover with minimum description length is also 
NP-Hard. In practice, we have found that if the num- 
ber of reference groups is less than 20, then it is fea- 
sible to solve it exactly on our testbed computers. For 
larger reference datasets, we use a well-known greedy 
approximation algorithm [16], which picks the set that 
has the maximal overlap, removes it from the reference 
set, and repeats the process. This is known to work 
within O(log m) of optimal, where m is the number of 
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GROUP MAPPING 

Input: S {summary statements}, G {reference groups} 
Output: 

GAM {accessibility misconfigs [users,objects] }, 
GSM {security misconfigs [users,objects] } 


1: GAM =¢:GSM =¢ 


2: U = all user-sets in the extracted summary statements S 
3: for all U; € U do 





4: (Cy, T;) = Map Groups (U;, G) 

5: for all G; € Ci do 

6: if eve i < 0.5 then 

7. GAM = GAM|}{[G; — Ui, O:]} 
8: end if 

Q: end for 

10: if to ! < 0.5 then 

11: GSM = GSM | J{[T:, Oi} 

2s end if 

13: end for 


14: return GAM,GSM 


MAP GROUPS (APPROXIMATE) 


Input: U; oe to be covered}, G {Groups } 
Output: C’; {cover from G}, T; {uncovered users in U; } 


1: C, = $37, = 659 =$;3U;, =U; 





3: if St! < 0.5 then 
4 G’ =G’U{G} 

5 end if 

6: end for 

7: repeat 


9: for all G € G’ do 


10: if MDL, C;, U{G}) < MDL min then 
11: 

12; MD beg = MDL(U;,C; U {Gmin }) 
13: end if 

14: end for 

15: if Gwin = ¢ then 

16: return C;, U; 

17: end if 


18: Cr = GC; LJ{Gmin} > U; = U; — Gmin 
19: until US = ¢ 
20: return C;,¢ 


Figure 8: Group Mapping Algorithm. 


users in the user set, for the original minimum set cover 
problem. We modify this algorithm suitably to gener- 
ate the approximate set cover with minimum description 
length. 

Figure 8 shows the pseudocode for our Group Map- 
ping algorithm. The main steps of the algorithm for a 
given list of user-sets {U , U2,---,U,} can be summa- 
rized as follows: 


Step 1: For each user-set, first eliminate all groups in 
which more than half of the users are not members 
of the user-set (lines 2-6 in MAP GROUPS, Fig- 
ure 8). Since less than half of the users in these 
reference group intersect with the user-set, this ref- 
erence group will not figure in either security or ac- 
cessibility misconfiguration candidates as defined in 
Section 5.2. 


Step 2: When the number of groups in G is less than 
20, we exhaustively search for all set covers and 
use the minimum. For larger G, we use a modi- 
fied version of the greedy set-cover algorithm to do 
the matching, as shown in Figure 8. For each user- 
set U;, we pick a group G that overlaps maximally 
with U; (pick any one in case of ties). To apply 
the minimum description length principle, we de- 
fine the description length for U; in terms of G as 
|U; — G| + |G — U;,|. For example, in user-set 2, 
two potential mappings are G1 as shown in the ex- 
ample, or Gg, which contains users C' and D. In 
the former case, — G,| is 0, and |Gy — U4] is 
2, since Gy contains two extraneous users, H and 
J. In the latter mapping, |U2z — Gs| is 3, since Gg 
covers C' and D and does not include /, F’ and G. 
Also, Us| is 0. Therefore the MDL metric for 
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the former cover is 2, while in the latter case it is 3. 
Hence our algorithm picks G, as the cover. Refer 
to lines 8-14 in MAP GROUPS, Figure 8. 


Add this selected group to the cover C;. Remove 
the covered users from U; to get U; and repeat until 
all users are covered, and the ones that cannot be 
covered by any group are output as T7;. Refer to 
lines 15-19 in MAP GROUPS, Figure 8. 


Using this mapping, we can find both security and 
accessibility misconfigurations for each user set U; ex- 
tracted from the summary statements (U; — O,), as 
shown in lines 4-14 GROUP MAPPING, Figure 8. The 
summary statement can be rewritten as: 


{G,U+-UG, UT} > Oy 


where G, = G;NU;,VG; € C;. Let AG; be the 
users in G; who are not in U;. Note that Step | en- 
sures that Bs ct < 0.5, that is AG; is a minority in G;. 
Based on the intuition provided in the previous section, 
we infer that users in AG; (if any) may require access 
to the objects O;. Hence, the intended access should be 
{G1 U--- UG. UT;} — O; and for each G; € C; 
such that corresponding AG; + ¢, the system reports 
accessibility misconfiguration candidate as: 





users in AG; MAY need access to O; 


Finding security misconfiguration candidates is a 
slightly different process. Again, for a given user-set Uj, 
the users in 7; are those that do not match any of the ref- 
erence groups but still have access to O;. If these users 
form a minority of the users in the user-set U;, that is 
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f , < 0.5 and T; 4 @, then the system infers that the in- 


tended access should be {Gy U---U G.} — O; and all 
users in 7; are reported to be security misconfiguration 
candidates as: 


users in T; MAY NOT need access to O; 


Note that while we use metrics based on simple ma- 

jority and minority to detect misconfiguration candi- 
dates, our prototype implementation supports any thresh- 
old value between O and 1. A higher threshold may find 
more valid misconfigurations but may also increase the 
number of false alarms. 
Complexity: The group mapping run time is bounded as 
O(k?lg), where k is the maximum number of users in a 
reference group, g is the number of reference groups and 
/ is the number of summary statements. 


3.3. Misconfiguration Prioritization 


When Baaz presents the misconfiguration report to the 
administrator, it lists the candidates in a priority order. 
Prioritization of candidate misconfigurations is impor- 
tant because administrators may not have the time to vali- 
date all misconfiguration candidates that Baaz outputs, as 
in Dataset 2 in Section 8. In such cases, a ranking func- 
tion helps them focus their attention on the high-value 
candidates. 

The main intuition behind our ranking function is that 
when the mismatches between a user-set and its covering 
reference group is smaller, the possibility of the miscon- 
figuration candidate being a valid issue is higher. The 
formula used for prioritization of both accessibility and 
security candidates capture this measure of difference in 
similarity between a user-set and its cover. 

For accessibility misconfigurations, for a given U;, the 
system computes a priority over each reference group G j 
in C;, as: 


Si tats ; ’ Sk JAG; | 
P (accessibility misconfig) = 1— Ta 
4 
For security misconfiguration candidates, we use the 
fraction of potentially unauthorized users to prioritize as 
follows. The smaller the fraction of uncovered users, the 
higher the priority. 





P(security misconfig) = 1— 


6 Object Clustering 


Our second technique for finding misconfiguration can- 
didates is Object Clustering. This procedure uses only 
the summary statements as input and is therefore partic- 
ularly useful in the absence of suitable reference groups. 
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Summ St5: C,D, E, F,G, H = 1, 2,3,4,5 Summ St 3: A, B, C, D = 9,10, 11, 12 


Summ St2: C,D,E,F,G = 6,7 Summ St4: A,B,C,D,| = 13 


H => 6,7 | Sp 13 





Figure 9: The result of the Object Clustering algorithm 
on the example subject matrix. 


6.1 Intuition 


We first present the intuition behind our Object Cluster- 
ing algorithm. When the access permissions for a small 
user-set is only slightly different from the access control 
for a much larger user-set, this may indicate a misconfig- 
uration. 

Figure 9 explains this intuition using our example. Ob- 
serve that the user-sets for summary statements 3 and 4 
differ in one user — J — because J has access to object 
13, but does not have access to any of 9, 10, 11 and 12. 
On the other hand, users A, B, C' and D have access to 
objects 9, 10, 11, 12 and 13. Therefore, Baaz suggests a 
security misconfiguration candidate: 


user I MAY NOT need access to object 13. 


Similarly, summary statements 5 and 2 differ in only 
one user — H — because H does not have access to objects 
6 and 7. Users C’, D, E, F' and G have access to 1, 2, 
3,4, 5, 6 and 7. Therefore, as shown in the figure, Baaz 
suggests an accessibility misconfiguration candidate: 


user H MAY need access to objects 6 and 7. 


The matrix in Figure 9 shows that if an administra- 
tor or resource owner determines that these are indeed 
valid misconfigurations and fixes them, the matrix be- 
comes more uniform. A future iteration of matrix reduc- 
tion will output fewer summary statements. In this ex- 
ample, C’, D, E, F, G and H now have identical access 
and hence the reduction will remove summary statement 
2. Similarly, since user J will no longer have access to 
object 13, statement 4 will not be found in future itera- 
tions. This will lead to our algorithms finding the same 
number, or fewer misconfiguration candidates in the fu- 
ture, if no changes are made to the input matrices. This 
supports our claim of internal consistency in Section 3.2. 
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OBJECT CLUSTERING 
Input: S {summary statements } 
Output: OAM {accessibility misconfigurations [users, objects]}, 
OSM {security misconfigurations [users, objects] } 
1: OAM=¢;OSM=¢ 
2: for all pairs of summary statements in S, [U1,O1] & [U2, O2] 
do 


3: if ol < 0.5 and ae < 0.5 and a < 0.5 then 
4: if U, — U2 ~ ¢ then 

5: OAM = OAM | J{[U1 — U2, O2)} 

6: end if 

a if Uz —U, ~ ¢ then 

8: OSM = OSM | ){[U2 — U1, O2)} 

9: end if 

10: end if 

11: end for 


12: return OAM,OSM 


Figure 10: Object Clustering algorithm. 


The Group Mapping and Object Clustering phases do 
not find disjoint sets of misconfigurations. For exam- 
ple, both the above misconfigurations were also flagged 
by Group Mapping. We intend to use Object Clustering 
as a fallback in situations where there do not exist suit- 
able reference groups to flag misconfiguration candidates 
through Group Mapping. 


6.2 Algorithm 

We now describe the Object Clustering algorithm in de- 
tail. We first look for pairs of summary statements with 
the following template: 





U, — O1 and Uz — Oo such that pO ge 0.5, 


[U1 | 
|U2—Ui| |O2| 
Th] < 0.5, and lon < 0.5 





Now, our definition of an object misconfiguration is as 
follows: For the two summary statements, U; — O, and 
Uz —> Oz that match the template, say |U, — U2|/|U1| 
and |Uz — U;|/|U;| are both smaller than 0.5 (a majority 
of users in Uj are in U2 and vice-versa), and |O2|/|O | is 
smaller than 0.5 (Oz is less than half the size of O;). We 
characterize a security misconfiguration candidate as: 


Us — U, MAY NOT need access to Oo. 


and an accessibility misconfiguration candidate is 
given as: 


U, — Ug MAY need access to Oo. 


Complexity: Given that there are / summary statements, 
nm users, and m objects, the Object Clustering algorithm 
runs in O(1?(n + m)) time. 


6.3 Misconfiguration Prioritization 


In the report, as in the case of Group Mapping, the Baaz 
server prioritizes these misconfigurations using the intu- 
ition that the more similar the user-sets U; and Us, and 
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the smaller the size of Og, the higher the probability that 
the candidate is a genuine misconfiguration. The metric 
we use is the harmonic mean: 





P(misconfig) = 0.5 * ((1 — eet) +(1— Se )) 


Here AU corresponds to U2 — U; or U; — U2 depend- 
ing on whether it is a security or an accessibility miscon- 
figuration. 


7 System Experiences 

In this section, we describe issues that impact the quality 
of the misconfiguration reports produced by Baaz, based 
on our experiences in implementing and evaluating the 
Baaz server and stubs for our prototype, and discuss how 
we address them in our system design. 


7.1 Server Design Issues 

Here, we discuss our choice of reference dataset in our 
deployment and how an administrator can tune report 
time. 

Choosing reference datasets: An administrator 
needs to use domain knowledge to choose the right ref- 
erence dataset for a given subject dataset. We observe 
that the output reports vary depending on how rich or 
rigid the reference groups are. Some reference datasets, 
such as organizational group-membership relations, are 
rigid and structured, and contain few reference groups, 
potentially generating many misconfiguration candidates 
in the Group Mapping step, several of which may be in- 
valid. This is because fewer groups will yield more ap- 
proximate covers. 

On the other hand, if a reference dataset contains a 
large number of reference groups, such as a set of email 
distribution lists, the report will contain fewer candidates 
because the chances of finding exact covers increases. As 
a result, the algorithm may not detect some valid mis- 
configurations. An administrator can decide which refer- 
ence dataset to use, based on the sensitivity of the subject 
dataset, trading manual effort of validation for caution. 
For example, if a subject dataset folder is marked confi- 
dential, the administrator may choose to compare it with 
the organizational hierarchy, whereas email lists may be 
a better choice for less sensitive information. 

In our evaluation described in Section 8, we choose 
email distribution lists as a reference dataset for two 
datasets and organizational hierarchy as a reference for 
one dataset, and our results verify our observations 
above. 

Tuning report time: Since change events trigger 
Baaz’s misconfiguration detection algorithms, the server 
may generate reports even in transient states while ad- 
ministrators manually change permissions. To avoid 
such spurious reports, each pair of subject and refer- 
ence datasets has an associated report time (T;,.): Baaz 
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includes a candidate in its report only if it has existed for 
at least T;. time. The administrator can configure 7;. to be 
short for subjects that store highly sensitive data, while it 
can be high for less important subjects. In our deployed 
prototype, we found that we could generate a report as 
fast as one second after a stub reports a change, or delay 
its reporting using T;., as required. 


7.2 Stub Design Issues 


We identify two design issues that directly play a role in 
the quality of generated reports: 

Modeling access control: The system’s misconfigu- 
ration detection can only be as good as the data the stub 
provides. Access control mechanisms can be compli- 
cated [20], which sometimes makes capturing complete 
semantics in a stub quite hard. In our stub implemen- 
tations, we have used a conservative approach towards 
modeling access control: if there is ambiguity of whether 
an individual or group has access to an object, we assume 
that they do indeed have access. This approach catches 
more security candidates albeit at the risk of increasing 
the number of false alarms. Previously proposed security 
monitoring systems have tackled this problem [6] using 
a similar strategy. 

Stub customization: Access mechanisms of different 
kinds of resources will require custom stub implemen- 
tations that can specifically understand the underlying 
access controls. Similarly, stubs may need to be cus- 
tomized to different data layouts containing group mem- 
bership data. However, some stubs can be reused across 
resources. For example, in our prototype, we have imple- 
mented a stub that can run on any SMB-based Windows 
file share. We have also implemented customized stubs 
to capture organizational hierarchy and email lists within 
our enterprise, both of which reside on an Active Direc- 
tory server [1] (an implementation of the Lightweight D1- 
rectory Access Protocol, LDAP). 

Access control permissions are not necessarily binary. 
For example, in a file share, “read-only” access or “full 
access” are only two of a number of different access 
types possible. Consequently, our stub implementations 
support various modes of operation. An administrator 
can choose what a “1” in the binary relation matrix cap- 
tures: full access, read-only access, any kind of access, 
etc. 


$8 Evaluation 


In this section, we first describe the implementation of 
Baaz system components (Section 8.1). Next, we de- 
scribe the results we achieve through our prototype de- 
ployment (Section 8.2), followed by a description of the 
collection, analysis, and validation of misconfiguration 
reports from two other datasets (Section 8.3). Finally, 
we present performance evaluation microbenchmarks for 
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demonstrating the scalability (Section 8.4) of the miscon- 
figuration detection algorithms. 


8.1 Implementation 


We have implemented the Baaz server in C# using 2707 
lines of code. We have also implemented Baaz stubs for 
an SMB-based Windows file server, for organizational 
groups in Active Directory [1], and for email distribu- 
tion lists also stored in Active Directory. The Windows 
file server stub is entirely event-based: it traps changes in 
access control through the FileSystem Watcher [8] library 
and reports these changes immediately to the server. Cur- 
rently, we only trap changes to access control for direc- 
tories, but we can easily extend this to capture changes 
for individual files. The Active Directory stubs, on the 
other hand, poll the database every 8 minutes since we 
do not have the right permissions or mechanisms to build 
an event-based stub for Active Directory. The file server 
stub used 830 lines of C# code and the Active Directory 
stub, which used a common code base for both the orga- 
nizational groups and email lists, was 1327 lines of C# 
code. 


8.2 Evaluation Through Deployment 


We have deployed Baaz within our organization, with 
stubs continuously monitoring two resources within our 
organization since August 19th, 2009. The stubs mon- 
itor read access permissions for directories on a Win- 
dows SMB file server that the employees use to share 
confidential data, and an Active Directory server storing 
email distribution lists relevant to the organization. Var- 
ious groups within the organization actively use the file 
server to share documents, hence we found significant 
usage of access control capabilities on it. 

The objective of our deployment was to see whether 
Baaz could help find valid access control misconfigura- 
tions on this file server. We therefore registered the file 
server as the subject dataset and the email distribution list 
as the reference dataset with the server. We decided to 
use email distribution lists as opposed to organizational 
hierarchy since our administrator observed that only or- 
ganizational groups might not capture the various user 
sets that actively use the file server. 

We show our results in three steps: first, we show 
how Baaz’s first report in the deployment was effective 
in finding misconfigurations. Second, we show the util- 
ity of continuously monitoring changes in access con- 
trol to find misconfigurations. Third, we compare our 
results with the ground-truth we established by manually 
inspecting directory permissions on the file server, to de- 
tect how many actual misconfigurations Baaz was able to 
flag. 

First-time report: Row | in Table 2 provides details 
on this dataset, and row 1 in Table 3 gives the classifica- 
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Table 2: Datasets used to evaluate Baaz. 


|| Group Mapping | Object Clustering |_—_——Group Mapping | Object Clustering 
|| Tot | Val. | Exc. | Inv. | Tot] Val. | Exc. | Inv. | Tot | Val. | Exc. | Inv. | Tot [ Val. | Exc. | Inv. | 
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Table 3: Misconfiguration analysis for each report generated by Baaz. 


tion of the first-time report that Baaz generated using the 
relation matrices that the stubs sent to the Baaz server 
initially. The total number of users in the organization 
is 149, the number of objects (directories) in the subject 
data set’s relation matrix is 105682, and the total num- 
ber of reference groups (or unique distribution lists) is 
237. The matrix reduction phase on the subject dataset 
produced 39 summary statements. 

Baaz flagged a total of 39 misconfiguration candidates. 
To validate these, we involved the system administrator 
and the respective resource owners of the directories in 
question. 

Security: Of the 11 security candidates that Baaz 
found through Group Mapping, 10 were valid secu- 
rity issues which the administrator considered important 
enough to fix immediately. Object Clustering found 7 of 
these 10 security misconfigurations, showing that Baaz 
would have been helpful in flagging security issues even 
if reference groups were not available to it. However 
it is clear that Group Mapping works more effectively 
than Object Clustering when a suitable reference dataset 
is available. 

Accessibility: Baaz found 8 accessibility candidates 
through Group Mapping, all of which were valid. All 9 
accessibility issues that Object Clustering flagged were 
invalid, showing that, with this dataset, while Group 
Mapping worked well in bringing out both security and 
accessibility issues, Object Clustering did well only with 
the security misconfigurations. Object Clustering was 
not effective in flagging valid accessibility issues since 
the difference between the summary statements were un- 
expectedly large. 

Baaz found a total of 18 valid misconfigurations. 
There were 10 security misconfigurations involving 7 
users which, when corrected, fixed access permissions on 
1639 out of 105682 directories on the file server. There 
were 8 accessibility misconfigurations that affected 6 
users and 163 directories. 

Our deployment also helped us understand some of the 
reasons for why misconfigurations occur in access con- 
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trol lists, which we summarize below. 


e In most cases, the misconfigurations arise because 
of employees changing their roles or, as in some ac- 
cessibility issues, from new employees joining the 
organization. 


e One of the security misconfigurations was caused 
by a policy change within the organization, which 
had only been partially implemented. Certain older 
employees had greater degree of access than newer 
employees since the administrator had inadvertently 
applied the policy change only to employees who 
had joined after the change was announced. 


e A resource owner misspelt the name of one of the 
users they wanted to provide access to, inadver- 
tently providing access to a completely unrelated 
employee. 


Real-time report: In our deployment, the stubs and 
the server are running continuously, monitoring access 
control and group membership changes and subsequently 
running the misconfiguration detection algorithm. On 
September 20th, 2009, an employee within the organi- 
zation adopted a new role, which was reflected by his ad- 
dition to certain email distribution lists. The Baaz stub 
reported these changes to the server, following which 
the server reported one new accessibility misconfigu- 
ration candidate within one second. The administrator 
considered this accessibility misconfiguration important 
enough to rectify promptly. This emphasizes the value of 
Baaz’s continuous monitoring approach since it enables 
administrators to detect misconfigurations in a nearly 
real-time fashion, just after they occur. 

Comparison to Ground-Truth: To understand how 
close Baaz was to finding all misconfigurations for this 
file server, we manually examined access permissions of 
all directories on the file server from the root down to 
three levels. Beyond the third level, we only examined 
directories whose access permissions differed from their 
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parent directories. We examined a total of 276 directo- 
ries. 

For each directory, we asked the directory owner two 
questions: If any user permissions to the directory should 
be revoked (security misconfiguration), and if anyone 
else should be provided access (accessibility misconfig- 
uration). This procedure took two days to complete be- 
cause of the manual effort involved. While we cannot 
claim that even this procedure would find all possible 
misconfigurations, we felt this exercise formed a good 
base-line to compare against Baaz. 

We found that Baaz missed 4 security misconfigura- 
tions and | accessibility misconfiguration. Two secu- 
rity issues went undetected because an email list rele- 
vant to these issues was marked as private by the owner, 
and hence our Active Directory stub could not read the 
members. If we had the permission to run the stub with 
administrator privileges, Baaz would have flagged these 
issues. The other 3 issues (2 security and | accessibility) 
were genuinely missed by Baaz since there were no ref- 
erence groups that matched the user-set, and the number 
of users involved in the misconfiguration (2) were more 
than half the size of the user-set (3). 

Hence, while Baaz genuinely missed 3 misconfigura- 
tions, it did flag 18 valid misconfigurations which the ad- 
ministrator found very useful. 


8.3. Snapshot Evaluation 


We evaluated Baaz on two other subject and reference 
data pairs. We wrote stubs to gather snapshots of ac- 
cess control and group memberships from these datasets 
and generated a one-time report. Rows 2 and 3 of Ta- 
ble 2 describe the datasets and Table 3 summarize our 
findings. Dataset 2’s subject is a server hosting shared 
internal web pages for projects and groups across an or- 
ganization. The stub for this subject reads access per- 
missions stored in an XML file in a custom format. The 
reference was, again, a set of email distribution lists cre- 
ated for this organization. This subject dataset comprised 
1794 users and 1917 objects. For this dataset alone, the 
administrator decided to concentrate on misconfiguration 
candidates with priority more than 0.8. 

In Dataset 3, the subject dataset is the set of email lists 
used as reference in Dataset 1, and the reference is the 
set of organizational groups. Here, each organizational 
group consists of a manager and all employees who re- 
port directly to the manager. As we have mentioned ear- 
lier, a reference dataset in Baaz may itself be inaccurate. 
Hence, this evaluation helps us check how stale the mem- 
berships to these email lists are. The number of users in 
this Dataset is 115 and the number of objects is 243. The 
slight discrepancy in the number of users in Datasets 1 
and 3 is due to organizational churn in the period be- 
tween when we ran the two experiments. 
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Baaz found many valid misconfigurations in all these 
datasets. Across all datasets, most security misconfigu- 
rations resulted due to role changes. Other security mis- 
configurations arose because an individual user, who had 
full permissions to an object, had inadvertently given ac- 
cess to another user who should not have had access. 
The causes of accessibility misconfigurations, similarly, 
were moves across organizations or inadvertent mistakes 
on the part of the individual manually assigning permis- 
sions. 

We now summarize some other insights we acquired 
through this evaluation. 

Administrator input: Baaz can only make recom- 
mendations. Only an administrator, or someone who has 
semantic knowledge about access requirements, needs to 
make the final decision of whether a misconfiguration is 
valid, an exception or invalid. For distributed access con- 
trol systems such as Windows file servers, the validation 
will have to be through querying multiple people in the 
organization since objects involved in the misconfigura- 
tion can have different owners. This is not a simple task. 

Despite this difficulty, overall, the administrators and 
resource owners found the system very useful since it 
found several valid security and accessibility misconfig- 
urations. Moreover, what the administrators appreciated 
was that, instead of tracking down correct access for po- 
tentially thousands of objects, they needed to concentrate 
on a much smaller set of misconfiguration candidates 
that Baaz reports. For Datasets 1 and 3, the validation 
was mostly through conversation and email, and took ap- 
proximately one hour. For Dataset 2, it took a total of 
three days turnaround time since we communicated only 
through email with resource owners who were at a re- 
mote site to complete the validation. Note that these are 
total turnaround times: it does not mean that an admin- 
istrator spent three complete days just on the validation 
procedure. 

Group Mapping vs Object Clustering: While Group 
Mapping is universally effective at finding misconfigura- 
tions, the Object Clustering approach is effective only in 
datasets which have a lot of statistical similarity. This 
is because Object Clustering relies on finding small de- 
viations from a regular and often-repeated pattern of ac- 
cess control permissions. Datasets 2 and 3 do not have 
a regular pattern since most project web pages and email 
distribution lists had unique access permissions. Conse- 
quently, Object Clustering does not report any miscon- 
figurations for these datasets. On the other hand, it does 
find misconfigurations for the file server (Dataset 1) since 
there were many directories on the file servers we evalu- 
ated with the same access permissions. 

Invalid Misconfigurations: The number of invalid 
misconfigurations varies significantly across the different 
datasets. This is related to our discussion in Section 7.1. 


19th USENIX Security Symposium —=_173 


174 














250 
gq 200+ 
= 
® 
£ 150 |} 
Cc 
= 
E100 | 
o 
D 
x ae 
are een " # ref. groups = 1296 —-— 
a # ref. GrOUDS = 324 -=-sae-- 
= # ref. groups = 81 «+ era 


0 
O 500000 1e+06 1.5e+06 2e+06 2.5e+06 3e+06 
Subject matrix size 


Figure 11: Scalability of the Baaz Algorithm 


The organizational groups form a rigid reference dataset, 
so in Dataset 3, we see a large number of invalid miscon- 
figurations. Across the datasets however, the number of 
invalid misconfigurations were small enough not to dis- 
courage an administrator in adopting our tool. 


8.4 Algorithm Performance 


In this Section, we concentrate on the performance and 
scalability of the server algorithm. We used Dataset | 
described in Table 2 for this experiment. 

We ran the misconfiguration detection algorithm on 
the dataset while varying the subject relation matrix size, 
keeping the number of reference groups constant. To in- 
crease the matrix size, we increased the directory depth 
up to which we included objects into the subject’s re- 
lation matrix, consequently increasing the number of ob- 
jects, and therefore, the number of columns in the matrix. 

Figure 11 shows the results of our experiments. Each 
line represents the algorithm’s total run time which in- 
cludes all three phases — Matrix Reduction, Group Map- 
ping and Object Clustering — with different numbers of 
reference groups. We varied the number of reference 
groups by adding artificially created groups to the ref- 
erence dataset while ensuring that the additional groups 
follow the same size distribution as the real reference 
groups. Every point in the graph is averaged across 20 
runs. We ran all the experiments on a machine with a 3 
GHz Intel Core 2 Duo CPU and 4 GB Memory, running 
a 64-bit version of Windows Server 2008. 

With a matrix size of 2.7 million, and with 1296 ref- 
erence groups, the misconfiguration detection takes a to- 
tal of 246 ms to run. The increase in time is fairly lin- 
ear in the matrix size because the Matrix Reduction step 
dominates the total run-time of the algorithm. For the 
same data point, where Matrix Reduction needs to in- 
spect roughly 2.7 million cells in the subject’s relation 
matrix, Group Mapping needed to process only 24 sum- 
mary statements and 1296 reference groups, and Ob- 
ject Clustering processed 2+Cz = 276 summary statement 
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pairs. 

Projecting from this graph, for a subject dataset rep- 
resenting 100,000 employees and 100,000 objects, 1.e., 
a matrix size of 10'°, and a reference dataset involving 
1296 groups, the misconfiguration detection would take 
approximately 340 seconds to run. Our experiments indi- 
cate that the algorithm can scale to large datasets (much 
larger than encountered in our deployments as shown in 
Table 2), and run fast enough to provide prompt miscon- 
figuration reports. 


9 Related Work 


In this section, we discuss our work in the context of 
related research. 

Recent work by Baker et al. in detecting policy mis- 
configurations [4] uses data mining to infer association- 
rules between groups of resources that can be accessed 
by common sets of users, based on an off-line analysis 
of access attempts in log files. The authors use the pro- 
file and frequency of granted requests to predict and fix 
operational accessibility issues. For example, if a user 
belonging to such a common set inadvertently does not 
have access to a particular resource, their tool will flag 
this as a misconfiguration, and refer this to an appropri- 
ate resource owner. 

Baaz on the other hand operates on access permis- 
sions. Consequently, in most cases, Baaz can flag and 
suggest fixes for misconfigurations before they can be 
exercised operationally. While access log analysis is an 
extremely useful mechanism in detecting security and 
accessibility issues, the approach is inherently comple- 
mentary to the approach of analyzing access control per- 
missions. Ideally, the two should be used in tandem. 

Also, Baaz primarily uses a different technique, Group 
Mapping, whereby the system compares subject and ref- 
erence datasets: several of the misconfigurations that the 
Group Mapping algorithm found in our evaluation could 
not have been found using association rules alone. These 
include the examples presented in Section 8.2 where 
users change roles, or new employees join an organiza- 
tion, and have not accessed any resources yet. In ad- 
dition, Baaz finds both security and accessibility issues 
whereas Baker et al. concentrate only on accessibility 
issues. 

Finally, the goal of their misconfiguration detection 1s 
similar in intent to Baaz’s Object Clustering algorithm. 
While Baaz focuses on identifying sets of users that can 
access disjoint sets of objects, they identify all possible 
sets of users who have common access permissions to 
(possibly overlapping) sets of objects. In Baaz, we 
chose to focus on disjoint object-sets for reasons ex- 
plained earlier. 

Network intrusion prevention and detection systems 
also have a similar operational view of misconfigura- 
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tions [15, 14]. An attempt is made to characterize nor- 
mal behavior, as opposed to anomalous behavior, and 
any deviation from this characterization is flagged as a 
potential vulnerability. In contrast, research on automat- 
ically discovering attack graphs [2, 23], by correlating in- 
formation across lists of known software-vulnerabilities, 
improper access controls, and network misconfigura- 
tion issues, have a forensic flavor. This aspect is fur- 
ther explored in more recent work such as HeatRay [6], 
which explores identity-snowball attacks based on over- 
entitled user privileges across a networked enterprise. 
The HeatRay tool outputs suggestions to administrators 
to prune privilege-lists on particular machines, maximiz- 
ing security versus availability tradeoffs, using machine 
learning and combinatorial optimization techniques. A 
system such as Baaz can help an administrator decide 
whether to remove access permissions as suggested by 
HeatRay. 


Other related work on policy anomaly detection in- 
cludes the work on access control spaces [13] where the 
authors describe a policy-authoring tool called Gokyo 
that can help discover policy coverage issues.While 
Gokyo assumes a high-level policy manifest exists, Baaz 
works in scenarios where such manifests are not avail- 
able. 


Role-based access control (RBAC) [21] 1s widely cited 
as a useful management tool to control access permis- 
sions by separating out the user-role and role-permission 
relationships. However, RBAC is known to be difficult 
to implement in practice [5, 12]. The problem of role 
mining [22, 25, 18, 28, 10] is related to Baaz’s matrix 
reduction step (Section 4), where we find related user 
and object groups. In role-mining, the user-object access 
matrix is analyzed to find maximal overlapping group- 
ings of users and objects that have the same permissions. 
In contrast, in Baaz, we are interested in misconfigura- 
tions on shared-object permissions, as opposed to dis- 
covering common patterns of access across user groups. 
Nevertheless, like organizational groups, email groups, 
and distribution lists, the output of a role-mining algo- 
rithm, specifically the user-role mappings, can be used 
as an input to our group mapping phase. We believe that 
even if organizations adopt some flavor of RBAC, a sys- 
tem like Baaz is useful in discovering misconfigurations 
caused by exceptions and role changes. There is also a 
wealth of related work on the topic of clustering in gen- 
eral, and a summary of this is outside the scope of this 
work. 

Policy anomaly detection is also a popular subject of 
study in the firewall and network configuration space. 
Here, existing tools [27] explore the semantics of differ- 
ent filtering rules and firewall policies. Testing and static 
analysis techniques [26, 17, 3] have been proposed to ex- 
plore and understand how policy configurations satisfy 
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properties such as redundancy and contradiction. How- 
ever, all of these techniques are specific to firewall con- 
figurations and are inherently different from Baaz which 
uses comparison across ACL datasets and within the 
same dataset to find misconfigurations. 

Several network security scanning tools are actively 
used by network administrators to find vulnerabilities 
such as open ports, vulnerable applications and poor 
passwords [7, 9]. Baaz’s purpose and techniques target 
a different problem — finding access control misconfigu- 
rations — and are therefore complementary to the intent 
of these tools. In fact, a number of such tools and sys- 
tems should be used in tandem to ensure a high level of 
security for all enterprise resources. 


10 Conclusion 


In this paper, we have described the design, implementa- 
tion and evaluation of Baaz, a system used to detect ac- 
cess control misconfigurations in shared resources. Baaz 
continuously monitors access permissions and group 
memberships, and through the use of two techniques — 
Group Mapping and Object Clustering — finds candidate 
misconfigurations in the access permissions. Our eval- 
uation shows that Baaz is very effective at finding real 
security and accessibility misconfigurations, which are 
useful to administrators. 
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Abstract 

Use-after-free vulnerabilities exploiting so-called dan- 
gling pointers to deallocated objects are just as dangerous 
as buffer overflows: they may enable arbitrary code exe- 
cution. Unfortunately, state-of-the-art defenses against 
use-after-free vulnerabilities require compiler support, 
pervasive source code modifications, or incur high per- 
formance overheads. This paper presents and evaluates 
Cling, a memory allocator designed to thwart these at- 
tacks at runtime. Cling utilizes more address space, a 
plentiful resource on modern machines, to prevent type- 
unsafe address space reuse among objects of different 
types. It infers type information about allocated objects 
at runtime by inspecting the call stack of memory allo- 
cation routines. Cling disrupts a large class of attacks 
against use-after-free vulnerabilities, notably including 
those hijacking the C++ virtual function dispatch mecha- 
nism, with low CPU and physical memory overhead even 
for allocation intensive applications. 


1 Introduction 


Dangling pointers are pointers left pointing to deallo- 
cated memory after the object they used to point to has 
been freed. Attackers may use appropriately crafted in- 
puts to manipulate programs containing use-after-free 
vulnerabilities [18] into accessing memory through dan- 
gling pointers. When accessing memory through a dan- 
gling pointer, the compromised program assumes it op- 
erates on an object of the type formerly occupying the 
memory, but will actually operate on whatever data hap- 
pens to be occupying the memory at that time. 

The potential security impact of these, so called, tem- 
poral memory safety violations is just as serious as that of 
the better known spatial memory safety violations, such 
as buffer overflows. In practice, however, use-after-free 
vulnerabilities were often dismissed as mere denial-of- 
service threats, because successful exploitation for arbi- 
trary code execution requires sophisticated control over 
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the layout of heap memory. In one well publicized case, 
flaw CVE-2005-4360 [17] in Microsoft IS remained un- 
patched for almost two years after being discovered and 
classified as low-risk in December 2005. 

Use-after-free vulnerabilities, however, are receiving 
increasing attention by security researchers and attack- 
ers alike. Researchers have been demonstrating exploita- 
tion techniques, such as heap spraying and heap feng 
shui [21, 1], that achieve the control over heap layout 
necessary for reliable attacks, and several use-after-free 
vulnerabilities have been recently discovered and fixed 
by security researchers and software vendors. By now 
far from a theoretical risk, use-after-free vulnerabilities 
have been used against Microsoft IE in the wild, such 
as CVE-2008-4844, and more recently CVE-2010-0249 
in the well publicized attack on Google’s corporate net- 
work. 

Such attacks exploiting use-after-free vulnerabilities 
may become more widespread. Dangling pointers likely 
abound in programs using manual memory management, 
because consistent manual memory management across 
large programs is notoriously error prone. Some dan- 
gling pointer bugs cause crashes and can be discovered 
during early testing, but others may go unnoticed be- 
cause the dangling pointer is either not created or not 
dereferenced in typical execution scenarios, or it is deref- 
erenced before the pointed-to memory has been reused 
for other objects. Nevertheless, attackers can still trigger 
unsafe dangling pointer dereferences by using appropri- 
ate inputs to cause a particular sequence of allocation and 
deallocation requests. 

Unlike omitted bounds checks that in many cases are 
easy to spot through local code inspection, use-after-free 
bugs are hard to find through code review, because they 
require reasoning about the state of memory accessed 
by a pointer. This state depends on previously executed 
code, potentially in a different network request. For the 
Same reasons, use-after-free bugs are also hard to find 
through automated code analysis. Moreover, the combi- 
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nation of manual memory management and object ori- 
ented programming in C++ provides fertile ground for 
attacks, because, as we will explain in Section 2.1, the 
virtual function dispatch mechanism is an ideal target for 
dangling pointer attacks. 

While other memory management related security 
problems, including invalid frees, double frees, and heap 
metadata overwrites, have been addressed efficiently and 
transparently to the programmer in state-of-the-art mem- 
ory allocators, existing defenses against use-after-free 
vulnerabilities incur high overheads or require compiler 
support and pervasive source code modifications. 

In this paper we describe and evaluate Cling, a mem- 
ory allocator designed to harden programs against use- 
after-free vulnerabilities transparently and with low over- 
head. Cling constrains memory allocation to allow ad- 
dress space reuse only among objects of the same type. 
Allocation requests are inferred to be for objects of the 
same type by inspecting the allocation routine’s call stack 
under the assumption that an allocation site (i.e. a call site 
of malloc or new) allocates objects of a single type 
or arrays of objects of a single type. Simple wrapper 
functions around memory allocation routines (for exam- 
ple, the typical my_malloc or safe_malloc wrap- 
pers checking the return value of malloc for NULL) 
can be detected at runtime and unwound to recover a 
meaningful allocation site. Constraining memory allo- 
cation this way thwarts most dangling pointer attacks 
—importantly— including those attacking the C++ vir- 
tual function dispatch mechanism, and has low CPU and 
memory overhead even for allocation intensive applica- 
tions. 

These benefits are achieved at the cost of using addi- 
tional address space. Fortunately, sufficient amounts of 
address space are available in modern 64-bit machines, 
and Cling does not leak address space over time, because 
the number of memory allocation sites in a program is 
constant. Moreover, for machines with limited address 
space, a mechanism to recover address space is sketched 
in Section 3.6. Although we did not encounter a case 
where the address space of 32-bit machines was insuf- 
ficient in practice, the margins are clearly narrow, and 
some applications are bound to exceed them. In the rest 
of this paper we assume a 64-bit address space—a rea- 
sonable requirement given the current state of technol- 
ogy. 

The rest of the paper is organized as follows. Section 2 
describes the mechanics of dangling pointer attacks and 
how type-safe memory reuse defeats the majority of at- 
tacks. Section 3 describes the design and implementa- 
tion of Cling, our memory allocator that enforces type- 
safe address space reuse at runtime. Section 4 evaluates 
the performance of Cling on CPU bound benchmarks 
with many allocation requests, as well as the Firefox web 
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browser (web browsers have been the main target of use- 
after-free attacks so far). Finally, we survey related work 
in Section 5 and conclude in Section 6. 


2 Background 


2.1 Dangling Pointer Attacks 


Use-after-free errors are, so called, temporal memory 
safety violations, accessing memory that is no longer 
valid. They are duals of the better known spatial memory 
safety violations, such as buffer overflows, that access 
memory outside prescribed bounds. Temporal memory 
safety violations are just as dangerous as spatial memory 
safety violations. Both can be used to corrupt memory 
with unintended memory writes, or leak secrets through 
unintended memory reads. 

When a program accesses memory through a dangling 
pointer during an attack, it may access the contents of 
some other object that happens to occupy the memory 
at the time. This new object may even contain data le- 
gitimately controlled by the attacker, e.g. content from 
a malicious web page. The attacker can exploit this to 
hijack critical fields in the old object by forcing the pro- 
gram to read attacker supplied values through the dan- 
gling pointer instead. 


Object 2 
of type B 


Object 1 Object 3 
of type A of type A 


Pointer field | Rawdata | Pointer field 


Time ————"> 


Asoway,\) ———> 






Figure 1: Unsafe memory reuse with dangling pointer. 


For example, if a pointer that used to point to an ob- 
ject with a function pointer field (e.g. object 1 at time to 
in Figure 1) is dereferenced to access the function pointer 
after the object has been freed, the value read for the 
function pointer will be whatever value happens to oc- 
cupy the object’s memory at the moment (e.g. raw data 
from object 2 at time ¢; in Figure 1). One way to ex- 
ploit this is for the attacker to arrange his data to end up 
in the memory previously occupied by the object pointed 
by the dangling pointer and supply an appropriate value 
within his data to be read in place of the function pointer. 
By triggering the program to dereference the dangling 
pointer, the attacker data will be interpreted as a function 
pointer, diverting program control flow to the location 


USENIX Association 


dictated by the attacker, e.g. to shellcode (attacker code 
smuggled into the process as data). 

Placing a buffer with attacker supplied data to the ex- 
act location pointed by a danging pointer is complicated 
by unpredictability in heap memory allocation. However, 
the technique of heap spraying can address this chal- 
lenge with high probability of success by allocating large 
amounts of heap memory in the hope that some of it will 
end up at the right memory location. Alternatively, the 
attacker may let the program dereference a random func- 
tion pointer, and similarly to uninitialized memory ac- 
cess exploits, use heap spraying to fill large amounts of 
memory with shellcode, hoping that the random location 
where control flow will land will be occupied by attacker 
code. 

Attacks are not limited to hijacking function pointers 
fields in heap objects. Unfortunately, object oriented pro- 
gramming with manual memory management is inviting 
use-after-free attacks: C++ objects contain pointers to 
virtual tables (vt ables) used for resolving virtual func- 
tions. In turn, these vt ables contain pointers to virtual 
functions of the object’s class. Attackers can hijack the 
vtable pointers diverting virtual function calls made 
through dangling pointers to a bogus vt able, and exe- 
cute attacker code. Such vt able pointers abound in the 
heap memory of C++ programs. 

Attackers may have to overcome an obstacle: the 
vtable pointer in a freed object is often aligned with 
the vtable pointer in the new object occupying the 
freed object’s memory. This situation is likely, because 
the vt able pointer typically occupies the first word of 
an object’s memory, and hence will be likely aligned 
with the vt able pointer of a new object allocated in its 
place right after the original object was freed. The attack 
is disrupted because the attacker lacks sufficient control 
over the new object’s vt able pointer value that is main- 
tained by the language runtime, and always points to a 
genuine, even if belonging to the wrong type, vtable, 
rather than arbitrary, attacker-controlled data. Attackers 
may overcome this problem by exploiting objects using 
multiple inheritance that have multiple vt able pointers 
located at various offsets, or objects derived from a base 
class with no virtual functions that do not have vtable 
pointers at offset zero, or by manipulating the heap to 
achieve an exploitable alignment through an appropriate 
sequence of allocations and deallocations. We will see 
that our defense prevents attackers from achieving such 
exploitable alignments. 

Attacks are not limited to subverting control flow; they 
can also hijack data fields [7]. Hijacked data pointers, for 
instance, can be exploited to overwrite other targets, in- 
cluding function pointers, indirectly: if a program writes 
through a data pointer field of a deallocated object, an 
attacker controlling the memory contents of the deallo- 
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cated object can divert the write to an arbitrary mem- 
ory location. Other potential attacks include information 
leaks through reading the contents of a dangling pointer 
now pointing to sensitive information, and privilege es- 
calation by hijacking data fields holding credentials. 

Under certain memory allocator designs, dangling 
pointer bugs can be exploited without memory having 
to be reused by another object. Memory allocator meta- 
data stored in free memory, such as pointers chaining free 
memory chunks into free lists, can play the role of the 
other object. When the deallocated object is referenced 
through a dangling pointer, the heap metadata occupy- 
ing its memory will be interpreted as its fields. For ex- 
ample, a free list pointer may point to a chunk of free 
memory that contains leftover attacker data, such as a 
bogus vtable. Calling a virtual function through the 
dangling pointer would divert control to an arbitrary lo- 
cation of the attacker’s choice. We must consider such 
attacks when designing a memory allocator to mitigate 
use-after-free vulnerabilities. 

Finally, in all the above scenarios, attackers exploit 
reads through dangling pointers, but writes through a 
dangling pointer could also be exploited, by corrupt- 
ing the object, or allocator metadata, now occupying the 
freed object’s memory. 






Pointer field 
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Figure 2: No memory reuse (very safe but expensive). 


2.2 Naive Defense 


A straight forward defense against use-after-free vul- 
nerabilities that takes advantage of the abundant ad- 
dress space of modern 64-bit machines is avoiding any 
address space reuse. Excessive memory consumption 
can be avoided by reusing freed memory via the op- 
erating system’s virtual memory mechanisms (e.g. re- 
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linquishing physical memory using madvise with the 
MADV_DONTNEED option on Linux, or other OS specific 
mechanisms). This simple solution, illustrated in Fig- 
ure 2, protects against all the attacks discussed in Sec- 
tion 2.1, but has three shortcomings. 

First, address space will eventually be exhausted. By 
then, however, the memory allocator could wrap around 
and reuse the address space without significant risk. 

The second problem is more important. Memory frag- 
mentation limits the amount of physical memory that can 
be reused through virtual memory mechanisms. Operat- 
ing systems manage physical memory in units of several 
Kilobytes in the best case, thus, each small allocation can 
hold back several Kilobytes of physical memory in adja- 
cent free objects from being reused. In Section 4, we 
show that the memory overhead of this solution is too 
high. 

Finally, this solution suffers from a high rate of sys- 
tem calls to relinquish physical memory, and attempting 
to reduce this rate by increasing the block size of mem- 
ory relinquished with a single system call leads to even 
higher memory consumption. 
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Figure 3: Type-safe memory reuse. 


2.3. Type-Safe Memory Reuse 


Type-safe memory reuse, proposed by Dhuryati et al. [9], 
allows some memory reuse while preserving type safety. 
It allows dangling pointers, but constrains them to point 
to objects of the same type and alignment. This way, 
dereferencing a dangling pointer cannot cause a type vi- 
olation, rendering use-after-free bugs hard to exploit in 
practice. As illustrated in Figure 3, with type-safe mem- 
ory reuse, memory formerly occupied by pointer fields 
cannot be reused for raw data, preventing attacks as the 
one in Figure 1. 

Moreover, memory formerly occupied by pointer 
fields can only overlap with the corresponding pointer 
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fields in objects of the same type. This means, for ex- 
ample, that a hijacked function pointer can only be di- 
verted to some other function address used for the same 
field in a different object, precluding diverting function 
pointers to attacker injected code, and almost certainly 
thwarting return-to-libc [20] attacks diverting function 
pointers to legitimate but suitable executable code in the 
process. More importantly, objects of the same type 
share vtables and their vtable pointers are at the 
same offsets, thus type-safe memory reuse completely 
prevents hijacking of vt able pointers. This is simi- 
lar to the attacker constraint discussed in Section 2.1, 
where the old vt able pointer happens to be aligned 
with another vt able pointer, except that attackers are 
even more constrained now: they cannot exploit differ- 
ences in inheritance relationships or evade the obstacle 
by manipulating the heap. 

These cases cover generic exploitation techniques and 
attacks observed in the wild. The remaining attacks are 
less practical but may be exploitable in some cases, de- 
pending on the application and its use of data. Some 
constraints may still be useful; for example, attacks that 
hijack data pointers are constrained to only access mem- 
ory in the corresponding field of another object of the 
same type. In some cases, this may prevent dangerous 
corruption or data leakage. However, reusing memory of 
an object’s data fields for another instance of the same 
type may still enable attacks, including privilege escala- 
tion attacks, e.g. when data structures holding credentials 
or access control information for different users are over- 
lapped in time. Another potential exploitation avenue are 
inconsistencies in the program’s data structures that may 
lead to other memory errors, e.g. a buffer may become in- 
consistent with its size stored in a different object when 
either is accessed through a dangling pointer. Interest- 
ingly, this inconsistency can be detected if spatial pro- 
tection mechanisms, such as bounds checking, are used 
in tandem. 


3 Cling Memory Allocator 


The Cling memory allocator is a drop-in replacement for 
malloc designed to satisfy three requirements: (i) it 
does not reuse free memory for its metadata, (ii) only al- 
lows address space reuse among objects of the same type 
and alignment, and (iii) achieves these without sacrific- 
ing performance. Cling combines several solutions from 
existing memory allocators to achieve its requirements. 


3.1 Out-of-Band Heap Metadata 


The first requirement protects against use-after-free vul- 
nerabilities with dangling pointers to free, not yet reallo- 
cated, memory. As we saw in Section 2.1, if the memory 
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allocator uses freed memory for metadata, such as free 
list pointers, these allocator metadata can be interpreted 
as object fields, e.g. vt able pointers, when free mem- 
ory is referenced through a dangling pointer. 

Memory allocator designers have considered using 
out-of-band metadata before, because attackers targeted 
in-band heap metadata in several ways: attacker con- 
trolled data in freed objects can be interpreted as heap- 
metadata through double-free vulnerabilities, and heap- 
based overflows can corrupt allocator metadata adjacent 
to heap-based buffers. If the allocator uses corrupt heap 
metadata during its linked list operations, attackers can 
write an arbitrary value to an arbitrary location. 

Although out-of-band heap metadata can solve these 
problems, some memory allocators mitigate heap meta- 
data corruption without resorting to this solution. For 
example, attacks corrupting heap metadata can be ad- 
dressed by detecting the use of corrupted metadata with 
sanity checks on free list pointers before unlinking a free 
chunk or using heap canaries [19] to detect corruption 
due to heap-based buffer overflows. In some cases, cor- 
ruption can be prevented in the first place, e.g. by detect- 
ing attempts to free objects already in a free list. These 
techniques avoid the memory overhead of out-of-band 
metadata, but are insufficient for preventing use-after- 
free vulnerabilities, where no corruption of heap meta- 
data takes place. 

An approach to address this problem in allocator de- 
signs reusing free memory for heap metadata is to ensure 
that these metadata point to invalid memory if interpreted 
as pointers by the application. Merely randomizing the 
metadata by XORing with a secret value may not be suf- 
ficient in the face of heap spraying. One option is setting 
the top bit of every metadata word to ensure it points to 
protected kernel memory, raising a hardware fault if the 
program dereferences a dangling pointer to heap meta- 
data, while the allocator would flip the top bit before 
using the metadata. However, it is still possible that 
the attacker can tamper with the dangling pointer before 
dereferencing it. This approach may be preferred when 
modifying an existing allocator design, but for Cling, we 
chose to keep metadata out-of-band instead. 

An allocator can keep its metadata outside deallo- 
cated memory using non-intrusive linked lists (next and 
prev pointers stored outside objects) or bitmaps. Non- 
intrusive linked lists can have significant memory over- 
head for small allocations, thus Cling uses a two-level 
allocation scheme where non-intrusive linked lists chain 
large memory chunks into free lists and small allocations 
are carved out of buckets holding objects of the same 
size class using bitmaps. Bitmap allocation schemes 
have been used successfully in popular memory alloca- 
tors aiming for performance [10], so they should not pose 
an inherent performance limitation. 


USENIX Association 


3.2 Type-Safe Address Space Reuse 


The second requirement protects against use-after-free 
vulnerabilities where the memory pointed by the dan- 
gling pointer has been reused by some other object. 
AS we saw in Section 2.3, constraining dangling point- 
ers to objects within pools of the same type and align- 
ment thwarts a large class of attacks exploiting use-after- 
free vulnerabilities, including all those used in real at- 
tacks. A runtime memory allocator, however, must ad- 
dress two challenges to achieve this. First, it must bridge 
the semantic gap between type information available to 
the compiler at compile time and memory allocation re- 
quests received at runtime that only specify the number 
of bytes to allocate. Second, it must address the memory 
overheads caused by constraining memory reuse within 
pools. Dhurjati et al. [9], who proposed type-safe mem- 
ory reuse for security, preclude an efficient implemen- 
tation without using a compile time pointer and region 
analysis. 


To solve the first challenge, we observe that security 
is maintained even if memory reuse is Over-constrained, 
i.e. several allocation pools may exist for the same type, 
as long as memory reuse across objects of different types 
is prevented. Another key observation is that in C/C++ 
programs, an allocation site typically allocates objects of 
a single type or arrays of objects of a single type, which 
can safely share a pool. Moreover, the allocation site 
is available to the allocation routines by inspecting their 
call stack. While different allocation sites may allocate 
objects of the same type that could also safely share the 
same pool, Cling’s inability to infer this could only af- 
fect performance—not security. Section 4 shows that 
in spite of this pessimization, acceptable performance is 
achieved. 


The immediate caller of a memory allocation routine 
can be efficiently retrieved from the call stack by inspect- 
ing the saved return address. However, multiple tail-call 
optimizations in a single routine, elaborate control flow, 
and simple wrappers around allocation routines may ob- 
scure the true allocation site. The first two issues are suf- 
ficiently rare to not undermine the security of the scheme 
in general. These problems are elaborated in Section 3.6, 
and ways to address simple wrappers are described in 
Section 3.5. 


A further complication, illustrated in Figure 4, 1s 
caused by array allocations and the lack of knowledge of 
array element sizes. As discussed, all new objects must 
be aligned to previously allocated objects, to ensure their 
fields are aligned one to one. This requirement also ap- 
plies to array elements. Figure 4, however, illustrates 
that this constraint can be violated if part of the mem- 
ory previously used by an array is subsequently reused 
by an allocation placed at an arbitrary offset relative to 
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Figure 4: Example of unsafe reuse of array memory, even 
with allocation pooling, due to not preserving allocation 
offsets. 


the start of the old allocation. Reusing memory from 
a pool dedicated to objects of the same type is not suf- 
ficient for preventing this problem. Memory reuse must 
also preserve offsets within allocated memory. One solu- 
tion is to always reuse memory chunks at the same offset 
within all subsequent allocations. A more constraining 
but simpler solution, used by Cling, is to allow memory 
reuse only among allocations of the same size-class, thus 
ensuring that previously allocated array elements will be 
properly aligned with array elements subsequently occu- 
pying their memory. 

This constraint also addresses the variable sized struct 
idiom, where the final field of a structure, such the fol- 
lowing one, is used to access additional, variable size 
memory allocated at the end of the structure: 


1 struct {| 

2 void (fp) (); 

3 int len; 

4 char buffer[1]/; 
a ey 


By only reusing memory among instances of such struc- 
tures that fall into the same size-class, and always align- 
ing such structures at the start of this memory, Cling pre- 
vents the structure’s fields, e.g. the function pointer fp in 
this example, from overlapping after their deallocation 
with buffer contents of some other object of the same 
type. 

The second challenge is to address the memory over- 
head incurred by pooling allocations. Dhurjati et al. [8] 
observe that the worst-case memory use increase for a 
program with N pools would be roughly a factor of 
N — 1: when a program first allocates data of type A, 
frees all of it, then allocates data of type B, frees all of 
it, and so on. This situation is even worse for Cling, be- 
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cause it has one pool per size-class per allocation site, 
instead of just one pool per type. 

The key observation to avoid excessive memory over- 
head is that physical memory, unlike address space, can 
be safely reused across pools. Cling borrows ideas from 
previous memory allocators [11] designed to manage 
physical memory in blocks (via mmayp) rather than mono- 
tonically growing the heap (via sbrk). These allocators 
return individual blocks of memory to the operating sys- 
tem as soon as they are completely free. This technique 
allows Cling to reuse blocks of memory across different 
pools. 

Cling manages memory in blocks of 16K bytes, satis- 
fying large allocations using contiguous ranges of blocks 
directly, while carving smaller allocations out of homo- 
geneous blocks called buckets. Cling uses an OS prim- 
itive (e.g. madvise) to inform the OS it can reuse the 
physical memory of freed blocks. 

Deallocated memory accessed through a dangling 
pointer will either continue to hold the data of the in- 
tended object, or will be zero-filled by the OS, trigger- 
ing a fault if a pointer field stored in it is dereferenced. 
It is also possible to page protect address ranges after 
relinquishing their memory (e.g. using mechanisms like 
mprotect on top of madvise). 

Cling does not suffer from fragmentation as the naive 
scheme described in Section 2.2, because it allows imme- 
diate reuse of small allocations’ memory within a pool. 
Address space consumption is also more reasonable: it 
is proportional to the number of allocation sites in the 
program, so it does not leak over time as in the naive 
solution, and is easily manageable in modern 64-bit ma- 
chines. 


3.3. Heap Organization 


Cling’s main heap is divided into blocks of 16K bytes. 
As illustrated in Figure 5, a smaller address range, 
dubbed the meta-heap, is reserved for holding block de- 
scriptors, one for each 16K address space block. Block 
descriptors contain fields for maintaining free lists of 
block ranges, storing the size of the block range, asso- 
ciating the block with a pool, and pointers to metadata 
for blocks holding small allocations. Metadata for block 
ranges are only set for the first block in the range—the 
head block. When address space is exhausted and the 
heap is grown, the meta-heap is grown correspondingly. 
The purpose of this meta-heap is to keep heap metadata 
separate, allowing reuse of the heap’s physical memory 
previously holding allocated data without discarding its 
metadata stored in the meta-heap. 

While memory in the heap area can be relinquished 
using madvise, metadata about address space must be 
kept in the meta-heap area, thus contributing to the mem- 
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ory overhead of the scheme. This overhead is small. A 
block descriptor can be under 32 bytes in the current 1m- 
plementation, and with a block size of 16K, this corre- 
sponds to memory overhead less than 0.2% of the ad- 
dress space used, which is small enough for the address 
space usage observed in our evaluation. Moreover, a 
hashtable could be employed to further reduce this over- 
head if necessary. 

Both blocks and block descriptors are arranged in 
corresponding linear arrays, as illustrated in Figure 5, 
so Cling can map between address space blocks and 
their corresponding block descriptors using operations 
on their addresses. This allows Cling to efficiently re- 
cover the appropriate block descriptor when deallocating 
memory. 
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Descriptors 
(Never Scrapped) 
16 KiB 
Block Resident Block 


7h 
Meta Heap 
v 


4 


Figure 5: Heap comprised of blocks and meta-heap of 
block descriptors. The physical memory of deallocated 
blocks can be scrapped and reused to back blocks in other 
pools. 


Cling pools allocations based on their allocation 
site. To achieve this, Cling’s public memory al- 
location routines (e.g. malloc and new) retrieve 
their call site using the return address saved on the 
stack. Since Cling’s routines have complete con- 
trol over their prologues, the return address can al- 
ways be retrieved reliably and efficiently (e.g. using the 


__builtin_return_address GCC primitive). At 


first, this return address is used to distinguish between 
memory allocation sites. Section 3.5 describes how to 
discover and unwind simple allocation routine wrappers 
in the program, which is necessary for obtaining a mean- 
ingful allocation site in those cases. 

Cling uses a hashtable to map allocation sites to pools 
at runtime. An alternative design to avoid hash table 
lookups could be to generate a trampoline for each call 
site and rewrite the call site at hand to use its dedicated 
trampoline instead of directly calling the memory allo- 
cation routine. The trampoline could then call a version 
of the memory allocation routine accepting an explicit 
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pool parameter. The hash table, however, was preferred 
because it is less intrusive and handles gracefully cor- 
ner cases including calling malloc through a function 
pointer. Moreover, since this hash table is accessed fre- 
quently but updated infrequently, optimizations such as 
constructing perfect hashes can be applied in the future, 
if necessary. 

Pools are organized around pool descriptors. The rel- 
evant data structures are illustrated in Figure 6. Each 
pool descriptor contains a table with free lists for block 
ranges. Each free list links together the head blocks of 
block ranges belonging to the same size-class (a power of 
two). These are blocks of memory that have been deal- 
located and are now reusable only within the pool. Pool 
descriptors also contain lists of blocks holding small al- 
locations, called buckets. Section 3.4 discusses small ob- 
ject allocation in detail. 

Initially, memory is not assigned to any pool. Larger 
allocations are directly satisfied using a power-of-two 
range of 16K blocks. A suitable free range is reused from 
the pool if possible, otherwise, a block range is allocated 
by incrementing a pointer towards the end of the heap, 
and it is assigned to the pool. If necessary, the heap is 
grown using a system call. When these large allocations 
are deallocated, they are inserted to the appropriate pool 
descriptor’s table of free lists according to their size. The 
free list pointers are embedded in block descriptors, al- 
lowing the underlying physical memory for the block to 
be relinquished using madvise. 


3.4 Small Allocations 


Allocations less than 8K in size (half the block size) are 
stored in slots inside blocks called buckets. Pool de- 
scriptors point to a table with entries to manage buck- 
ets for allocations belonging to the same size class. Size 
classes start from a minimum of 16 bytes, increase by 16 
bytes up to 128 bytes, and then increase exponentially 
up to the maximum of 8K, with 4 additional classes in 
between each pair of powers-of-two. Each bucket is as- 
sociated with a free slot bitmap, its element size, and a 
bump pointer used for fast allocation when the block is 
first used, as described next. 

Using bitmaps for small allocations seems to be a 
design requirement for keeping memory overhead low 
without reusing free memory for allocator metadata, so 
it 1s critical to ensure that bitmaps are efficient com- 
pared to free-list based implementations. Some effort 
has been put into making sure Cling uses bitmaps ef- 
ficiently. Cling borrows ideas from reaps [5] to avoid 
bitmap scanning when many objects are allocated from 
an allocation site in bursts. This case degenerates to just 
bumping a pointer to allocate consecutive memory slots. 
All empty buckets are initially used in bump mode, and 
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Figure 6: Pool organization illustrating free lists of blocks available for reuse within the pool and the global hot bucket queue 
that delays reclamation of empty bucket memory. Linked list pointers are not stored inside blocks, as implied by the figure, but 
rather in their block descriptors stored in the meta-heap. Blocks shaded light gray have had their physical memory reclaimed. 


stay in that mode until the bump pointer reaches the end 
of the bucket. Memory released while in bump mode is 
marked in the bucket’s bitmap but is not used for satis- 
fying allocation requests while the bump pointer can be 
used. 


A pool has at most one bucket in bump mode per size 
class, pointed by a field of the corresponding table entry, 
as illustrated in Figure 6. Cling first attempts to satisfy an 
allocation request using that bucket, if available. Buck- 
ets maintain the number of freed elements in a counter. A 
bucket whose bump pointer reaches the end of the bucket 
is unlinked from the table entry and, if the counter in- 
dicates it has free slots, inserted into a list of non-full 
buckets. If no bucket in bump mode is available, Cling 
attempts to use the first bucket from this list, scanning 
its bitmap to find a free slot. If the counter indicates the 
bucket is full after an allocation request, the bucket is 
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unlinked from the list of non-full buckets, to avoid ob- 
stracting allocations. 


Conversely, if the counter of free elements is zero prior 
to a deallocation, the bucket is re-inserted into the list of 
non-full buckets. If the counter indicates that the bucket 
is completely empty after deallocation, it is inserted to a 
list of empty buckets queuing for memory reuse. This ap- 
plies even for buckets in bump mode (and was important 
for keeping memory overhead low). This list of empty 
buckets is consulted on allocation if there is neither a 
bucket in bump mode, nor a non-full bucket. If this list is 
also empty, a new bucket is created using fresh address 
space, and initialized in bump mode. 


Empty buckets are inserted into a global queue of hot 
buckets, shown at the bottom of Figure 6. This queue has 
a configurable maximum size (10% of non-empty buck- 
ets worked well in our experiments). When the queue 
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size threshold is reached after inserting an empty bucket 
to the head of the queue, a hot bucket is removed from 
the tail of the queue, and becomes cold: its bitmap is 
deallocated, and its associated 16K of memory reused 
via an madvise system call. If a cold bucket is en- 
countered when allocating from the empty bucket list of 
a pool, a new bitmap is allocated and initialized. The 
hot bucket queue is important for reducing the number 
of system calls by trading some memory overhead, con- 
trollable through the queue size threshold. 


3.5 Unwinding Malloc Wrappers 


Wrappers around memory allocation routines may con- 
ceal real allocation sites. Many programs wrap malloc 
simply to check its return value or collect statistics. Such 
programs could be ported to Cling by making sure that 
the few such wrappers call macro versions of Cling’s al- 
location routines that capture the real allocation site, i.e. 
the wrapper’s call site. That is not necessary, however, 
because Cling can detect and handle many such wrap- 
pers automatically, and recover the real allocation site by 
unwinding the stack. This must be implemented care- 
fully because stack unwinding is normally intended for 
use in slow, error handing code paths. 

To detect simple allocation wrappers, Cling initiates 
a probing mechanism after observing a single allocation 
site requesting multiple allocation sizes. This probing 
first uses a costly but reliable unwind of the caller’s stack 
frame (using 1ibunwind) to discover the stack loca- 
tion of the suspected wrapper function’s return address. 
Then, after saving the original value, Cling overwrites 
the wrapper’s return address on the stack with the ad- 
dress of a special assembler routine that will be inter- 
posed when the suspected wrapper returns. After Cling 
returns to the caller, and, in turn, the caller returns, the 
overwritten return address transfers control to the inter- 
posed routine. This routine compares the suspected al- 
location wrapper’s return value with the address of the 
memory allocated by Cling, also saved when the probe 
was initiated. If the caller appears to return the address 
just returned by Cling, it is assumed to be a simple wrap- 
per around an allocation function. 

To simplify the implementation, probing is aborted if 
the potential wrapper function issues additional alloca- 
tion requests before returning. This is not a problem in 
practice, because simple malloc wrappers usually per- 
form a single allocation. Moreover, a more thorough im- 
plementation can easily address this. 

The probing mechanism is only initiated when multi- 
ple allocation sizes are requested from a single alloca- 
tion site, potentially delaying wrapper identification. It 
is unlikely, however, that an attacker could exploit this 
window of opportunity in large programs. Furthermore, 
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this rule helps prevent misidentifying typical functions 
encapsulating the allocation and initialization of objects 
of a single type, because these request objects of a sin- 
gle size. Sometimes, such functions allocate arrays of 
various sizes, and can be misidentified. Nevertheless, 
these false positives are harmless for security; they only 
introduce more pools that affect performance by over- 
constraining allocation, and the performance impact in 
our benchmarks was small. 

Similarly, the current implementation identifies func- 
tions such as st rdup as allocation wrappers. While we 
could safely pool their allocations (they are of the same 
type), the performance impact in our benchmarks was 
again small, so we do not handle them in any special 
way. 

While this probing mechanism handles well the com- 
mon case of malloc wrappers that return the allocated 
memory through their function return value, it would not 
detect a wrapper that uses some other mechanism to re- 
turn the memory, such as modifying a pointer argument 
passed to the wrapper by reference. Fortunately, such 
malloc wrappers are unusual. 

Allocation sites identified as potential wrappers 
through this probing mechanism are marked as such in 
the hashtable mapping allocation site addresses to their 
pools, so Cling can unwind one more stack level to get 
the real allocation site whenever allocation requests from 
such an allocation site are encountered, and associate it 
with a distinct pool. 

Stack unwinding is platform specific and, in general, 
expensive. In 32-bit x86 systems, the frame pointer reg- 
ister ebp links stacks frames together, making unwind- 
ing reasonably fast, but this register may be re-purposed 
in optimized builds. Heuristics can still be used with 
optimized code, e.g. looking for a value in the stack 
that points into the text segment, but they are slower. 
Data-driven stack unwinding on 64-bit AMD64 systems 
is more reliable but, again, expensive. Cling uses the 
1ibunwind library to encapsulate platform specific de- 
tails of stack unwinding, but caches the stack offset of 
wrappers’ return addresses to allow fast unwinding when 
possible, as described next, and gives up unwinding if 
not. 

Care must be taken when using a cached stack offset to 
retrieve the real allocation site, because the cached value 
may become invalid for functions with a variable frame 
size, e.g. those using alloca, resulting in the retrieval 
of a bogus address. To guard against this, whenever a 
new allocation site is encountered that was retrieved us- 
ing acached stack offset, a slow but reliable unwind (us- 
ing 1ibunwind) is performed to confirm the allocation 
site’s validity. If the check fails, the wrapper must have 
a variable frame size, and Cling falls back to allocating 
all memory requested through that wrapper from a single 


19th USENIX Security Symposium — 185 


186 


pool. In practice, typical malloc wrappers are simple 
functions with constant frame sizes. 


3.6 Limitations 


Cling prevents vt able hiyacking, the standard exploita- 
tion technique for use-after-free vulnerabilities, and its 
constraints on function and data pointers are likely to 
prevent their exploitation, but it may not be able to pre- 
vent use-after-free attacks targeting data such as creden- 
tials and access control lists stored in objects of a single 
type. For example, a dangling pointer that used to point 
to the credentials of one user may end up pointing to the 
credentials of another user. 

Another theoretical attack may involve data structure 
inconsistencies, when accessed through dangling point- 
ers. For example, if a buffer and a variable holding its 
length are in separate objects, and one of them is read 
through a dangling pointer accessing an unrelated object, 
the length variable may be inconsistent with the actual 
buffer length, allowing dangerous bound violations. In- 
terestingly, this can be detected if Cling is used in con- 
junction with a defense offering spatial protection. 

Cling relies on mapping allocation sites to object 
types. A program with contrived flow of control, how- 
ever, such as in the following example, would obscure 
the type of allocation requests: 


1 ant size = condition ? sizeof( <- 
struct A) : sizeof(struct B); 
2 void xobj = malloc(size); 


Fortunately, this situation is less likely when allocating 
memory using the C++ operator new that requires a type 
argument. 

A similar problem occurs when the allocated object is 
a union: objects allocated at the same program location 
may still have different types of data at the same offset. 

Tail-call optimizations can also obscure allocation 
sites. Tail-call optimization is applicable when the call to 
malloc is the last instruction before a function returns. 
The compiler can then replace the call instruction with 
a simple control-flow transfer to the allocation routine, 
avoiding pushing a return address to the stack. In this 
case, Cling would retrieve the return address of the func- 
tion calling malloc. Fortunately, in most cases where 
this situation might appear, using the available return ad- 
dress still identifies the allocation site uniquely. 

Cling cannot prevent unsafe reuse of stack allocated 
objects, for example when a function erroneously returns 
a pointer to a local variable. This could be addressed by 
using Cling as part of a compiler-based solution, by mov- 
ing dangerous (e.g. address taken) stack based variables 
to the heap at compile time. 


19th USENIX Security Symposium 


Custom memory allocators are a big concern. They al- 
locate memory in huge chunks from the system allocator, 
and chop them up to satisfy allocation requests for indi- 
vidual objects, concealing the real allocation sites of the 
program. Fortunately, many custom allocators are used 
for performance when allocating many objects of a sin- 
gle type. Thus, pooling such custom allocator’s requests 
to the system allocator, as done for any other allocation 
site, is sufficient to maintain type-safe memory reuse. It 
is also worth pointing that roll-your-own general purpose 
memory allocators have become a serious security liabil- 
ity due to a number of exploitable memory management 
bugs beyond use-after-free (invalid frees, double frees, 
and heap metadata corruption in general). Therefore, us- 
ing a custom allocator in new projects is not a decision 
to be taken lightly. 

Usability in 32-bit platforms with scarce address space 
is limited. This is less of a concern for high-end and fu- 
ture machines. If necessary, however, Cling can be com- 
bined with a simple conservative collector that scans all 
words in used physical memory blocks for pointers to 
used address space blocks. This solution avoids some 
performance and compatibility problems of conservative 
garbage collection by relying on information about ex- 
plicit deallocations. Once address space is exhausted, 
only memory that is in use needs to be scanned and 
any 16K block of freed memory that is not pointed by 
any word in the scanned memory can be reused. The 
chief compatibility problem of conservative garbage col- 
lection, namely hidden pointers (manufactured pointers 
invisible to the collector), cannot cause premature deal- 
locations, because only explicitly deallocated memory 
would be garbage collected in this scheme. Neverthe- 
less, relying on the abundant address space of modern 
machines instead, is more attractive, because garbage 
collection may introduce unpredictability or expose the 
program to attacks using hidden dangling pointers. 


3.7 Implementation 


Cling comes as a shared library providing implementa- 
tions for the malloc and the C++ operator new alloca- 
tion interfaces. It can be preloaded with platform specific 
mechanisms (e.g. the LD_PRELOAD environment vari- 
able on most Unix-based systems) to override the sys- 
tem’s memory allocation routines at program load time. 


4 Experimental Evaluation 


4.1 Methodology 


We measured Cling’s CPU, physical memory, and vir- 
tual address space overheads relative to the default GNU 
libc memory allocator on a 2.66GHz Intel Core 2 Q9400 
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Figure 7: Cumulative distribution function of memory 
allocation sizes for gzip, vpr, gcc, parser, and 
equake. 
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Figure 8: Cumulative distribution function of mem- 
ory allocation sizes for perlbmk, vortex, twolf, 
espresso, and gobmk. 


CPU with 4GB of RAM, running x86_64 GNU/Linux 
with a version 2.6 Linux kernel. We also measured two 
variations of Cling: without wrapper unwinding and us- 
ing a single pool. 

We used benchmarks from the SPEC CPU 2000 
and (when not already included in CPU 2000) 2006 
benchmark suites [22]. Programs with few alloca- 
tions and deallocations have practically no overhead 
with Cling, thus we present results for SPEC bench- 
marks with at least 100,000 allocation requests. We also 
used espresso, an allocation intensive program that 
is widely used in memory management studies, and is 
useful when comparing against related work. Finally, 
in addition to CPU bound benchmarks, we also evalu- 
ated Cling with a current version of the Mozilla Firefox 
web browser. Web browsers like Firefox are typical at- 
tack targets for use-after-free exploits via malicious web 
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Figure 9: Cumulative distribution function of mem- 
ory allocation sizes for hmmer, h264ref, omnetpp, 
astar, and dealll. 
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Figure 10: Cumulative distribution function of mem- 
ory allocation sizes for sphinx3, soplex, povray, 
xalancbmk, and Firefox. 


sites; moreover, unlike many benchmarks, Firefox is an 
application of realistic size and running time. 

Some programs use custom allocators, defeating 
Cling’s protection and masking its overhead. For these 
experiments, we disabled a custom allocator implemen- 
tation in parser. The gcc benchmark also uses a 
custom allocation scheme (obstack) with different 
semantics from malloc that cannot be readily dis- 
abled. We include it to contrast its allocation size 
distribution with those of other benchmarks. Recent 
versions of Firefox also use a custom allocator [10] 
that was disabled by compiling from source with the 
—-disable-jemalloc configuration option. 

The SPEC programs come with prescribed input data. 
For espresso, we generated a uniformly random input 
file with 15 inputs and 15 outputs, totalling 32K lines. 
For Firefox, we used a list of 200 websites retrieved from 
our browsing history, and replayed it using the -remote 
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Orig. (MiB) 


Benchmark Orig: (See) 


gzip 
vpr 
gcc 
parser 
equake 
perlbmk 
vortex 
twolf 
gobmk 628.6 
hmmer 542.15 
dealll 476.74 
sphinx3 1143.6 
h264ref 934.71 
omnetpp 57357 
soplex 524.01 
povray 272.54 
astar 656.09 
xalancbmk 421.03 
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44.69 
40.31 
809.46 | 1.70 
59.93 | 1.37 
80.18 | 1.52 
183.45 | 1.03 
639.51 | 2.31 
34.1 | 0.77 
345.51 | 1.56 
1.14 436.54 | 1.45 


Cling Ratio 


espresso 49 i. 14 3,877,784 77,711 3,877,783 77,711 
firefox 2101 51 595 22,579,058 | 464,565 22,255,963 464536 


Table 1: Memory allocation sites and requests in benchmarks and Firefox browser. 


option to direct a continuously running Firefox instance 
under measurement to a new web site every 10 seconds. 

We report memory consumption using information ob- 
tained through the /proc/self/status Linux inter- 
face. When reporting physical memory consumption, the 
sum of the VmRSS and VmPTE fields is used. The lat- 
ter measures the size of the page tables used by the pro- 
cess, which increases with Cling due to the larger address 
space. In most cases, however, it was still very small in 
absolute value. The VmSize field is used to measure 
address space size. The VmPeak and VmHWM fields are 
used to obtain peak values for the VmSize and VmRSS 
fields respectively. 

The reported CPU times are averages over three runs 
with small variance. CPU times are not reported for Fire- 
fox, because the experiment was IO bound with signifi- 
cant variance. 


4.2 Benchmark Characterization 


Figures 7—10 illustrate the size distribution of alloca- 
tion requests made by any given benchmark running with 
their respective input data. We observe that most bench- 
marks request a wide range of allocation sizes, but the 
gcc benchmark that uses a custom allocator mostly re- 
quests memory in chunks of 4K. 

Table | provides information on the number of static 
allocation sites in the benchmarks and the absolute num- 
ber of allocation and deallocation requests at runtime. 
For allocation sites, the first column is the number of al- 
location sites that are not wrappers, the second column is 
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the number of allocation sites that are presumed to be in 
allocation routine wrappers (such as safe_malloc in 
twolf, my_malloc in vpr, and xmalloc in gcc), 
and the third column is the number of call sites of these 
wrappers, that have to be unwound. We observe that 
Firefox has an order of magnitude more allocation sites 
than the rest. 

The number of allocation and deallocation requests for 
small (less than 8K) and large allocations are reported 
separately. The vast majority of allocation requests are 
for small objects and thus the performance of the bucket 
allocation scheme is crucial. In fact, no attempt was 
made to optimize large allocations in this work. 


4.3. Results 


Table 2 tabulates the results of our performance measure- 
ments. We observe that the runtime overhead is modest 
even for programs with a higher rate of allocation and 
deallocation requests. With the exception of espresso 
(16%), parser (12%), and deallII (8%), the over- 
head is less than 2%. Many other benchmarks with few 
allocation and deallocation requests, not presented here, 
have even less overhead—an interesting benefit of this 
approach, which, unlike solutions interposing on mem- 
ory accesses, does not tax programs not making heavy 
use of dynamic memory. 

In fact, many benchmarks with a significant num- 
ber of allocations run faster with Cling. For example 
xalancbmk, a notorious allocator abuser, runs 25% 
faster. In many cases we observed that by tuning allo- 
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| Other 
2921 1.16 1.07 1.10 4.63 | 1.13 1.06 1.02 19.36 | 2.08 


Table 2: Experimental evaluation results for the benchmarks. 


cator parameters such as the block size and the length of 
the hot bucket queue, we were able to trade memory for 
speed and vice versa. In particular, with different block 
sizes, xalancbmk would run twice as fast, but with a 
memory overhead around 40%. 

In order to factor out the effects of allocator design 
and tuning as much as possible, Table 2 also includes 
columns for CPU and memory overhead using Cling 
with a single pool (which implies no unwinding over- 
head as well). We observe that in some cases Cling 
with a single pool is faster and uses less memory than 
the system allocator, hiding the non-zero overheads of 
pooling allocations in the full version of Cling. On the 
other hand, for some benchmarks with higher overhead, 
such as dealII and parser, some of the overhead re- 
mains even without using pools. For these cases, both 
slow and fast, it makes sense to compare the overhead 
against Cling with a single pool. A few programs, how- 
ever, like xalancbmk, use more memory or run slower 
with a single pool. As mentioned earlier, this benchmark 
is quite sensitive to allocator tweaks. 

Table 2 also includes columns for CPU and memory 
overhead using Cling with many pools but without un- 
winding wrappers. We observe that for espresso and 
parser, some of the runtime overhead is due to this 
unwinding. 

Peak memory consumption was also low for most 
benchmarks, except for parser (14%), soplex 
(27%), povray (33%), and espresso (13%). Inter- 
estingly, for soplex and povray, this overhead is not 


USENIX Association 


because of allocation pooling: these benchmarks incur 
similar memory overheads when running with a single 
pool. In the case of soplex, we were able to deter- 
mine that the overhead is due to a few large realloc 
requests, whose current implementation in Cling is sub- 
optimal. The allocation intensive benchmarks parser 
and espresso, on the other hand, do appear to incur 
memory overhead due to pooling allocations. Disabling 
unwinding also affects memory use by reducing the num- 
ber of pools. 

The last two columns of Table 2 report virtual ad- 
dress space usage. We observe that Cling’s address 
space usage is well within the capabilities of modern 64- 
bit machines, with the worst increase less than 150%. 
Although 64-bit architectures can support much larger 
address spaces, excessive address space usage would 
cost in page table memory. Interestingly, in all cases, 
the address space increase did not prohibit running the 
programs on 32-bit machines. Admittedly, however, it 
would be pushing up against the limits. 

In the final set of experiments, we ran Cling with Fire- 
fox. Since, due to the size of the program, this is the 
most interesting experiment, we provide a detailed plot 
of memory usage as a function of time (measured in al- 
located Megabytes of memory), and we also compare 
against the naive solution of Section 2.2. 

The naive solution was implemented by preventing 
Cling from reusing memory and changing the memory 
block size to 4K, which is optimal in terms of mem- 
ory reuse. (It does increase the system call rate how- 


19th USENIX Security Symposium = 189 


190 





300 —— | 

Cling ——— ; 

System ~~ au, otto 
Cling (1 Pool) ~~ os 


2 
. Naive oe - 


200 


150 


eae oot +--+ gp + ep 


100 


Memory Usage (MiB) 


“se 
7 
7 
eee 
J 
50 t 
V 





O | 1! 1 | 1 1 ! 
0 2000 4000 6000 8000 10000 12000 14000 16000 
Requested Memory (MiB) 


Figure 11: Firefox memory usage over time (measured in 
requested memory). 


ever.) The naive solution could be further optimized by 
not using segregated storage classes, but this would not 
affect the memory usage significantly, as the overhead 
of rounding small allocation requests to size classes in 
Cling is at most 25%—and much less in practice. 


Figure 11 graphs memory use for Firefox. We ob- 
serve that Cling (with pools) uses similar memory to the 
system’s default allocator. Using pools does incur some 
overhead, however, as we can see by comparing against 
Cling using a single pool (which is more memory effi- 
cient than the default allocator). Even after considering 
this, Cling’s approach of safe address space reuse ap- 
pears usable with large, real applications. We observe 
that Cling’s memory usage fluctuates more than the de- 
fault allocator’s because it aggressively returns memory 
to the operating system. These graphs also show that the 
naive solution has excessive memory overhead. 


Finally, Figure 12 graphs address space usage for Fire- 
fox. It illustrates the importance of returning memory 
to the operating system; without doing so, the scheme’s 
memory overhead would be equal to its address space 
use. We observe that this implied memory usage with 
Firefox may not be prohibitively large, but many of the 
benchmarks evaluated earlier show that there are cases 
where it can be excessive. As for the address space usage 
of the naive solution, it quickly goes off the chart because 
it is linear with requested memory. The naive solution 
was also the only case where the page table overhead 
had a significant contribution during our evaluation: in 
this experiment, the system allocator used 0.99 MiB in 
page tables, Cling used 1.48 MiB, and the naive solu- 
tion 19.43 MiB. 
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Figure 12: Firefox address space usage over time (mea- 
sured in requested memory). 


5 Related Work 


Programs written in high-level languages using garbage 
collection are safe from use-after-free vulnerabilities, be- 
cause the garbage collector never reuses memory while 
there is a pointer to it. Garbage collecting unsafe lan- 
guages like C and C++ is more challenging. Neverthe- 
less, conservative garbage collection [6] is possible, and 
can address use-after-free vulnerabilities. Conservative 
garbage collection, however, has unpredictable runtime 
and memory overheads that may hinder adoption, and is 
not entirely transparent to the programmer: some port- 
ing may be required to eliminate pointers hidden from 
the garbage collector. 


DieHard [4] and Archipelago [16] are memory alloca- 
tors designed to survive memory errors, including dan- 
gling pointer dereferences, with high probability. They 
can survive dangling pointer errors by preserving the 
contents of freed objects for a random period of time. 
Archipelago improves on DieHard by trading address 
space to decrease physical memory consumption. These 
solutions are similar to the naive solution of Section 2.2, 
but address some of its performance problems by even- 
tually reusing memory. Security, however, is compro- 
mised: while their probabilistic guarantees are suitable 
for addressing reliability, they are insufficient against at- 
tackers who can adapt their attacks. Moreover, these so- 
lutions have considerable runtime overhead for alloca- 
tion intensive applications. DieHard (without its replica- 
tion feature) has 12% average overhead but up to 48.8% 
for perlbmk and 109% for twolf. Archipelago has 
6% runtime overhead across a set of server applications 
with low allocation rates and few live objects, but the 
allocation intensive espresso benchmark runs 7.32 
times slower than using the GNU libc allocator. Cling 
offers deterministic protection against dangling pointers 
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(but not spatial violations), with significantly lower over- 
head (e.g. 16% runtime overhead for the allocation in- 
tensive espresso benchmark) thanks to allowing type- 
safe reuse within pools. 

Dangling pointer accesses can be detected using 
compile-time instrumentation to interpose on every 
memory access [3, 24]. This approach guarantees com- 
plete temporal safety (sharing most of the cost with spa- 
tial safety), but has much higher overhead than Cling. 

Region-based memory management (e.g. [14]) is a 
language-based solution for safe and efficient memory 
management. Object allocations are maintained in a lex- 
ical stack, and are freed when the enclosing block goes 
out of scope. To prevent dangling pointers, objects can 
only refer to other objects in the same region or regions 
higher up the stack. It may still have to be combined with 
garbage collection to address long-lived regions. Its per- 
formance is better than using garbage collection alone, 
but it is not transparent to programmers. 

A program can be manually modified to use reference- 
counted smart pointers to prevent reusing memory of ob- 
jects with remaining references. This, however, requires 
major changes to application code. HeapSafe [12], on 
the other hand, is a solution that applies reference count- 
ing to legacy code automatically. It has reasonable over- 
head over a number of CPU bound benchmarks (geomet- 
ric mean of 11%), but requires recompilation and some 
source code tweaking. 

Debugging tools, such as Electric Fence, use a new 
virtual page for each allocation of the program and 
rely on page protection mechanisms to detect dangling 
pointer accesses. The physical memory overheads due 
to padding allocations to page boundaries make this ap- 
proach impractical for production use. Dhuryjati et al. [8] 
devised a mechanism to transform memory overhead to 
address space overhead by wrapping the memory allo- 
cator and returning a pointer to a dedicated new virtual 
page for each allocation but mapping it to the physical 
page used by the original allocator. The solution’s run- 
time overhead for Unix servers is less than 4%, and for 
other Unix utilities less than 15%, but incurs up to 11x 
slowdown for allocation intensive benchmarks. 

Interestingly, type-safe memory reuse (dubbed type- 
stable memory management [13]) was first used to sim- 
plify the implementation of non-blocking synchroniza- 
tion algorithms by preventing type errors during specu- 
lative execution. In that case, however, it was not applied 
indiscriminately, and memory could be safely reused af- 
ter some time bound; thus, performance issues addressed 
in this work were absent. 

Dynamic pool allocation based on allocation site in- 
formation retrieved by malloc through the call stack 
has been used for dynamic memory optimization [25]. 
That work aimed to improve performance by laying out 
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objects allocated from the same allocation site consecu- 
tively in memory, in combination with data prefetching 
instructions inserted into binary code. 

Dhurjati et al. [9] introduced type-homogeneity as a 
weaker form of temporal memory safety. Their solution 
uses automatic pool allocation at compile-time to seg- 
regate objects into pools of the same type, only reusing 
memory within pools. Their approach is transparent to 
the programmer and preserves address space, but relies 
on imprecise, whole-program analysis. 

WIT [2] enforces an approximation of memory safety. 
It thwarts some dangling pointer attacks by constraining 
writes and calls through hijacked pointer fields in struc- 
tures accessed through dangling pointers. It has an aver- 
age runtime overhead of 10% for SPEC benchmarks, but 
relies on imprecise, whole-program analysis. 

Many previous systems only address the spatial di- 
mension of memory safety (e.g. bounds checking sys- 
tems like [15]). These can be complemented with Cling 
to address both spatial and temporal memory safety. 

Finally, address space layout randomization (ASLR) 
and data execution prevention (DEP) are widely used 
mechanisms designed to thwart exploitation of memory 
errors in general, including use-after-free vulnerabilities. 
These are practical defenses with low overhead, but they 
can be evaded. For example, a non-executable heap can 
be bypassed with, so called, return-to-libc attacks [20] 
diverting control-flow to legitimate executable code in 
the process image. ASLR can obscure the locations of 
such code, but relies on secret values, which a lucky or 
determined attacker might guess. Moreover, buffer over- 
reads [23] can be exploited to read parts of the memory 
contents of a process running a vulnerable application, 
breaking the secrecy assumptions of ASLR. 


6 Conclusions 


Pragmatic defenses against low-level memory corrup- 
tion attacks have gained considerable acceptance within 
the software industry. Techniques such as stack ca- 
naries, address space layout randomization, and safe ex- 
ception handling —thanks to their low overhead and 
transparency for the programmer— have been read- 
ily employed by software vendors. In particular, at- 
tacks corrupting metadata pointers used by the mem- 
ory management mechanisms, such as invalid frees, dou- 
ble frees, and heap metadata overwrites, have been ad- 
dressed with resilient memory allocator designs, benefit- 
ing many programs transparently. Similar in spirit, Cling 
is a pragmatic memory allocator modification for defend- 
ing against use-after-free vulnerabilities that is readily 
applicable to real programs and has low overhead. 

We found that many of Cling’s design requirements 
could be satisfied by combining mechanisms from suc- 
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cessful previous allocator designs, and are not inherently 
detrimental for performance. The overhead of mapping 
allocation sites to allocation pools was found acceptable 
in practice, and could be further addressed in future 1m- 
plementations. Finally, closer integration with the lan- 
guage by using compile-time libraries is possible, espe- 
cially for C++, and can eliminate the semantic gap be- 
tween the language and the memory allocator by for- 
warding type information to the allocator, increasing se- 
curity and flexibility in memory reuse. Nevertheless, the 
current instantiation has the advantage of being readily 
applicable to a problem with no practical solutions. 
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Abstract 


In recent years, many advances have been made in 
cryptography, as well as in the performance of commu- 
nication networks and processors. As a result, many ad- 
vanced cryptographic protocols are now efficient enough 
to be considered practical, yet research in the area re- 
mains largely theoretical and little work has been done 
to use these protocols in practice, despite a wealth of po- 
tential applications. 

This paper introduces a simple description language, 
ZKPDL, and an interpreter for this language. ZKPDL 
implements non-interactive zero-knowledge proofs of 
knowledge, a primitive which has received much atten- 
tion in recent years. Using our language, a single pro- 
gram may specify the computation required by both the 
prover and verifier of a zero-knowledge protocol, while 
our interpreter performs a number of optimizations to 
lower both computational and space overhead. 

Our motivating application for ZKPDL has been the 
efficient implementation of electronic cash. As such, 
we have used our language to develop a cryptographic 
library, Cashlib, that provides an interface for using e- 
cash and fair exchange protocols without requiring ex- 
pert knowledge from the programmer. 


1 Introduction 


Modern cryptographic protocols are complicated, 
computationally intensive, and, given their security re- 
quirements, require great care to implement. However, 
one cannot expect all good cryptographers to be good 
programmers, or vice versa. As aresult, many newly pro- 
posed protocols—often described as efficient enough for 
deployment by their authors—are left unimplemented, 
despite the potentially useful primitives they offer to sys- 
tem designers. We believe that a lack of high-level soft- 
ware support (such as that provided by OpenSSL, which 
provides basic encryption and hashing) presents a barrier 
to the implementation and deployment of advanced cryp- 
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tographic protocols, and in this work attempt to remove 
this obstacle. 

One particular area of recent cryptographic research 
which has applications for privacy-preserving systems is 
zero-knowledge proofs [46, 45, 16, 38], which provide 
a way of proving that a statement is true without re- 
vealing anything beyond the validity of the statement. 
Among the applications of zero-knowledge proofs are 
electronic voting [48, 55, 37, 50], anonymous authenti- 
cation [20, 35, 61], anonymous electronic ticketing for 
public transportation [49], verifiable outsourced compu- 
tation [8, 42], and essentially any system in which hon- 
esty needs to be enforced without sacrificing privacy. 
Much recent attention has been paid to protocols based 
on anonymous credentials [29, 34, 23, 25, 10, 7], which 
allow users to anonymously prove possession of a valid 
credential (e.g., a driver’s license), or prove relationships 
based on data associated with that credential (e.g., that a 
user’s age lies within a certain range) without revealing 
their identity or other data. These protocols also prevent 
the person verifying a credential and the credential’s is- 
suer from colluding to link activity to specific users. As 
corporations and governments move to put an increas- 
ing amount of personal information online, the need for 
efficient privacy-preserving systems has become increas- 
ingly important and a major focus of recent research. 

Another application of zero-knowledge proofs is elec- 
tronic cash. The primary aim of our work has been to 
enable the efficient deployment of secure, anonymous 
electronic cash (e-cash) in network applications. Like 
physical coins, e-coins cannot be forged; furthermore, 
given two e-coins it is impossible to tell who spent them, 
or even if they came from the same user. For this rea- 
son, e-cash holds promise for use in anonymous settings 
and privacy-preserving applications, where free-riding 
by users may threaten a system’s stability. 

Actions in any e-cash system can be characterized 
as in Figure 1. There are two centralized entities: the 
bank and the arbiter. The bank keeps track of users’ ac- 
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Figure 1: An overview of the entities involved in our e-cash 
system. Users may engage in buy or barter transactions, with- 
draw and deposit coins as necessary, and consult the arbiter for 
resolution only in the case of a dispute. 





count balances, lets the users withdraw money, and ac- 
cepts coin deposits. The arbiter (a trusted third party) re- 
solves any disputes that arise between users in the course 
of their fair exchanges. Once the users have obtained 
money from the bank, they are free to exchange coins for 
items (or just barter for items) and in this way create an 
economy. 


In previous work [9] we describe a privacy-preserving 
P2P system based on BitTorrent that uses our e-cash 
and fair exchange protocols to incentivize users to share 
data. Here, the application of e-cash provides protection 
against selfish peers, as well as an incentive to upload for 
peers who have completed their download and thus have 
no need to continue participating. This system has been 
realized by our work on the Buy and Barter protocols, 
described in Section 6.2, which allow a user to fairly ex- 
change e-coins for blocks of data, or barter one block of 
data for another. 


These e-cash protocols can also be used for payments 
in other systems that face free-riding problems, such as 
anonymous onion routing [26]. In such a system, routers 
would be paid for forwarding messages using e-cash, 
thus providing incentives to route traffic on behalf of oth- 
ers in a manner similar to that proposed by Androulaki et 
al. [1]. Since P2P systems like these require each user to 
perform many cryptographic exchanges, the need to pro- 
vide high performance for repeated executions of these 
protocols is paramount. 


1.1 Our contribution 


In this paper, we hope to bridge the gap between de- 
sign and deployment by providing a language, ZKPDL 
(Zero-Knowledge Proof Description Language), that en- 
ables programmers and cryptographers to more easily 
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implement privacy-preserving protocols. We also pro- 
vide a library, Cashlib, that builds upon our language to 
provide simple access to cryptographic protocols such as 
electronic cash, blind signatures, verifiable encryption, 
and fair exchange. 


The design and implementation of our language and 
library were motivated by collaborations with systems 
researchers interested in employing e-cash in high- 
throughput applications, such as the P2P systems de- 
scribed earlier. The resulting performance concerns, and 
the complexity of the protocols required, motivated our 
library’s focus on performance and ease of use for both 
the cryptographers designing the protocols and the sys- 
tems programmers charged with putting them into prac- 
tice. These twin concerns led to our language-based ap- 
proach and work on the interpreter. 


The high-level nature of our language brings two ben- 
efits. First, it frees the programmer from having to worry 
about the implementation of cryptographic primitives, 
efficient mathematical operations, generating and pro- 
cessing messages, etc.; instead, ZKPDL allows the spec- 
ification of a protocol in a manner similar to that of theo- 
retical descriptions. Second, it allows our library to make 
performance optimizations based on analysis of the pro- 
tocol description itself. 


ZKPDL permits the specification of many widely- 
used zero-knowledge proofs. We also provide an in- 
terpreter that generates and verifies proofs for protocols 
described by our language. The interpreter performs 
optimizations such as precomputation of expected ex- 
ponentiations, translations to prevent redundant proofs, 
and caching compiled versions of programs to be loaded 
when they are used again on different inputs. More de- 
tails on these optimizations are provided in Section 4.2. 


Our e-cash library, Cashlib, described in Section 6, sits 
atop our language to provide simple access to higher- 
level cryptographic primitives such as e-cash [26], blind 
signatures [24], verifiable encryption [27], and optimistic 
fair exchange [9, 51]. Because of the modular nature of 
our language, we believe that the set of primitives pro- 
vided by our library can be easily extended to include 
other zero-knowledge protocols. 

Finally, we hope that our efforts will encourage pro- 
grammers to use (and extend) our library to implement 
their cryptographic protocols, and that our language will 
make their job easier; we welcome contribution by our 
fellow researchers in this effort. Documentation and 
source code for our library can be found online at http: 
//github.com/brownie/cashlib. 


2 Cryptographic Background 


There are two main modern cryptographic primitives 
used in our framework: commitment schemes and zero- 
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knowledge proofs. Briefly, a commitment scheme can 
be thought of as cryptographically analogous to an enve- 
lope. When a user Alice wants to commit to a value, she 
puts the value in the envelope and seals it. Upon receiv- 
ing a commitment, a second user Bob cannot tell which 
value is in the envelope; this property is called hiding (in 
this analogy, let’s assume Alice is the only one who can 
open the envelope). Furthermore, because the envelope 
is Sealed, Alice cannot sneak another value into the enve- 
lope without Bob knowing: this property is called bind- 
ing. To eventually reveal the value inside the envelope, 
all Alice has to do is open it (cryptographically, she does 
this by revealing the private value and any randomness 
used to form the commitment; this collection of values 1s 
aptly referred to as the opening of the commitment). We 
employ both Pedersen commitments [64] and Fujisaki- 
Okamoto commitments [41, 36], which rely on the secu- 
rity of the Discrete Log assumption and the Strong RSA 
assumption respectively. 


Zero-knowledge proofs [46, 45] provide a way of 
proving that a statement is true to someone without that 
person learning anything beyond the validity of the state- 
ment. For example, if the statement were “I have access 
to this sytem” then the verifier would learn only that I 
really do have access, and not, for example, how I gain 
access or what my access code is. In our library, we make 
use of sigma proofs [33], which are three-message proofs 
that achieve a weaker variant of zero-knowledge known 
as honest-verifier zero-knowledge. We do not implement 
sigma protocols directly; instead, we use the Fiat-Shamir 
heuristic [40] that transforms sigma protocols into non- 
interactive (fully) zero-knowledge proofs, secure in the 
random oracle model [12]. 


A primitive similar to zero-knowledge is the idea of a 
proof of knowledge [11], in which the prover not only 
proves that a statement is true, but also proves that it 
knows a reason why the statement is true. Extending 
the above example, this would be equivalent to proving 
the statement “I have access to the system, and I know a 
password that makes this true.” 


In addition to these cryptographic primitives, our li- 
brary also makes uses of hash functions (both univer- 
sal one-way hashes [60] and Merkle hashes [59]), digital 
signatures [47], pseudo-random functions [44], and sym- 
metric encryption [32]. The security of the protocols in 
our library relies on the security of each of these individ- 
ual components, as well as the security of any commit- 
ment schemes or zero-knowledge proofs used. 


3 Design 


The design of our library and language arose from our 
initial goal of providing a high-performance implemen- 
tation of protocols for e-cash and fair exchange for use 
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in applications such as those described in the introduc- 
tion. For these applications, the need to support many 
repeated interactions of the same protocol efficiently is a 
paramount concern for both the bank and the users. In the 
bank’s case, it must conduct withdraw and deposit pro- 
tocols with every user in the system, while in the user’s 
case it is possible that a user would want to conduct many 
transactions using the same system parameters. 
Motivated by these performance requirements, we ini- 
tially developed a more straightforward implementation 
of our protocols using C++ and GMP [43], but found 
that our ability to modify and optimize our implementa- 
tion was hampered by the complexity of our protocols. 
High-level changes to protocols required significant ef- 
fort to re-implement; meanwhile, potentially useful per- 
formance optimizations became difficult to implement, 
and there was no way to easily extend the functionality 


of the library. 
aN 
ZKPDL 
Program 


Interpreter 
Verifier 


Proof verified? 
(true/false) 









public values (Security 
parameters, public keys, 
groups, generators, etc) |---------- 


epo----- - - - - - - - - - - - ee ee 


' Verifier ; 
Figure 2: Usage of a ZKPDL program: the same program is 
compiled separately by the prover and verifier, who may also 
be provided with a set of fixed public parameters. This pro- 
duces an Interpreter object, which can be used by the prover to 
prove to a verifier that his private values satisfy a certain set of 
relationships. Serialization and processing of proof messages 
are provided by the library. Once compiled, an interpreter can 
be re-used on different private inputs, using the same public 
parameters that were originally provided. 


These difficulties led to our current design, illustrated 
in Figure 2. Our system allows a pseudocode-like de- 
scription of a protocol to be developed using our descrip- 
tion language, ZKPDL. This program is compiled by our 
interpreter, and optionally provided a list of public pa- 
rameters, which are “compiled in” to the program. At 
compile time, a number of transformations and optimiza- 
tions are performed on the abstract syntax tree produced 
by our parser, which we developed using the ANTLR 
parser generator [63]. Once compiled, these interpreter 
objects can be used repeatedly by the prover to generate 
zero-knowledge proofs about private values, or by the 
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verifier to verify these proofs. 

Key to our approach is the simplicity of our language. 
It is not Turing-complete and does not allow for branch- 
ing or conditionals; it simply describes the variables, 
equations, and relationships required by a protocol, leav- 
ing the implementation details up to the interpreter and 
language framework. This framework, described in the 
following section, provides C++ classes that parse, ana- 
lyze, optimize, and interpret ZKPDL programs, employ- 
ing many common compiler techniques (e.g., constant 
substitution and propagation, type-checking, providing 
error messages when undefined variables are used, etc.) 
in the process. We are able to understand and transform 
mathematical expressions into forms that provide better 
performance (e.g., through techniques for fixed-base ex- 
ponentiation), and recognize relationships between val- 
ues to be proved in zero-knowledge. All of these low- 
level optimizations, as well as our high-level primitives, 
should enable a programmer to quickly implement and 
evaluate the efficiency of a protocol. 


We also provide a number of C++ classes that wrap 
ZKPDL programs into interfaces for generating and ver- 
ifying proofs, as well as marshaling them between com- 
puters. We build upon these wrappers to additionally 
provide Cashlib, a collection of interfaces that allows a 
programmer to assume the role of buyer, seller, bank, 
or arbiter in a fair exchange system based on endorsed 
e-cash [26], as seen in Figure 1 and described in Sec- 
tion 5.3. 


4 Implementation of ZKPDL 


To enable implementation of the cryptographic prim- 
itives discussed in Section 2, we have designed a pro- 
gramming language for specifying zero-knowledge pro- 
tocols, as well as an interpreter for this language. The 
interpreter is implemented in C++ and consists of ap- 
proximately 6000 lines of code. On the prover side, the 
interpreter will output a zero-knowledge proof for the re- 
lations specified in the program; on the verifier side, the 
interpreter will be given a proof and verify whether or 
not it is correct. Therefore, the output of the interpreter 
depends on the role of the user, although the program 
provided to the interpreter is the same for both. 


4.1 Overview 


Here we provide a brief overview of some fundamen- 
tal language features to give an idea of how programs are 
written; a full grammar for our language, containing all 
of its features, can be found in our documentation avail- 
able online, and further sample programs can be found 
in Section 5. A program can be broken down into two 
blocks: a computation block and a proof block. Each of 
these blocks is optional: if a user just wants a calculator 
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for modular (or even just integer) arithmetic then he will 
specify just the computation block; if, on the other hand, 
he has all the input values pre-computed and justs wants a 
zero-knowledge proof of relations between these values, 
he will specify just the proof block. Here is a sample pro- 
gram written in our language (indentations are included 


for readability, and are not required syntax). 
sample.zkp 
computation: // compute values required for proof 
given: // declarations 
group: G = <g,h> 
exponents in G: x[2:3] 
compute: // declarations and assignments 
random exponents in G: r[1:3] 
Re SS eee eS 
for(i, 1:3, c_i := g4x_i * h4r_i) 





proof: 
given: // declarations of public values 
group: G = <g,h> 
elements in G: c[1:3] 


14 for(i, 1:3, commitment to x_i: c_i = g4x_i * h‘4r_i) 


prove knowledge of: // declarations of private values 
exponents in G: x[1:3], r[1:3] 

such that: // protocol specification; i.e. relations 
Sb el oe 8 





In this example, we are proving that the value x; con- 
tained within the commitment c, is the product of the 
two values x2 and x3 contained in the commitments co 
and c3. The program can be broken down in terms of how 
variables are declared and used, and the computation and 
proof specifications. Note that some lines are repeated 
across the computation and proof blocks, as both are op- 
tional and hence considered independently. 


4.1.1 Variable declaration 


Two types of variables can be declared: group objects 
and numerical objects. Names of groups must start with 
a letter and cannot have any subscripts; sample group 
declarations can be seen in lines 3 and 12 of the above 
program. In these lines, we also declare the group gen- 
erators, although this declaration is optional (as we will 
see later on in Section 5, it is also optional to name the 
group modulus). 

Numerical objects can be declared in two ways. The 
first is in a list of variables, where their type is specified 
by the user. Valid types are element, exponent (which 
refer respectively to elements within a finite-order group 
and the corresponding exponents for that group), and 
integer; it should be noted that for the first two of these 
types a corresponding group must also be specified in the 
type information (see lines 4 and 13 for an example). The 
other way in which variables can be declared is in the 
compute block, where they are declared as they are be- 
ing assigned (meaning they appear on the left-hand side 
of an equation), which we can see in lines 7 and 8. In this 
case, the type is inferred by the values on the right-hand 
side of the equation; a compile-time exception will be 
thrown if the types do not match up (for example, if el- 
ements from two different groups are being multiplied). 
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Numerical variables must start with a letter and are al- 
lowed to have subscripts. 


4.1.2 Computation 


The computation block breaks down into two blocks 
of its own: the given block and the compute block. The 
given block specifies the parameters, as well as any val- 
ues that have already been computed by the user and are 
necessary for the computation (in the example, the group 
G can be considered a system parameter and the values 
x_2 and x_3 are just needed for the computation). 

The compute block carries out the specified compu- 
tations. There are two types of computations: picking a 
random value, and defining a value by setting it equal to 
the right-hand side of an equation. We can see an ex- 
ample of the former in line 6 of our sample program; 
in this case, we are picking three random exponents in 
a group (note r[1:3] is just syntactic sugar for writing 
r_l, r_2, r_3). Wealso support picking a random in- 
teger from a specified range, and picking a random prime 
of a specified length (examples of these can be found 
in Section 5). As already noted, lines 7 and 8 provide 
examples of lines for computing equations. In line 8, 
the for syntax is again just syntactic sugar; this time 
to succintly specify the relations c_1 = g4x_1*h‘r_1, 
c_2 = g4x_2*h4r_2, and c_3 = g4x_3*h4r_3. We 
have a similar for syntax for specifying products or 
sums (much like |[[ or 5) in conventional mathemati- 
cal notation), but neither of these for macros should be 
confused with a for loop in a conventional programming 
language. 


4.1.3. Proof specification 


The proof block is comprised of three blocks: the 
given block, the prove knowledge of block, and the 
such that block. In the given block, the parame- 
ters for the proof are specified, as well as the public 
inputs known to both the prover and verifier for the 
zero-knowledge protocol. In the prove knowledge of 
block, the prover’s private inputs are specified. Finally, 
the such that block specifies the desired relations be- 
tween all the values; the zero-knowledge proof will be 
a proof that these relations are satisfied. We currently 
support four main types of relations: 


e Proving knowledge of the opening of a commit- 
ment [66]. We can prove openings of Pedersen [64] 
or Fujisaki-Okamoto commitments [41, 36]. In both 
cases we allow for commitments to multiple values. 


e Proving equality of the openings of different com- 
mitments. Given any number of commitments, we 
can prove the equality of any subset of the values 
contained within the commitments. 


e Proving that a committed value is the product of two 


USENIX Association 


other committed values [36, 17]. As seen in our 
sample program, we can prove that a value x con- 
tained within a commitment is the product of two 
other values y, z contained within two other com- 
mitments; 1.e., 7 = y- z. AS a special case, we can 
also prove that 7 = y’. 


e Proving that a committed value is contained within 
a public range [17, 54]. We can prove that the 
value x contained within a given commitment sat- 
isfies lo < x < hi, where lo and hz are both public 
values. 


There are a number of other zero-knowledge proof 
types (e.g., proving a value is a Blum integer, proving 
that committed values satisfy some polynomial relation- 
ship, etc.), but we chose these four based on their wide 
usage in applications, in particular in e-cash and anony- 
mous credentials. We note, however, that adding other 
proof types to the language should require little work (as 
mentioned in Section 4.2), as we specifically designed 
the language and interpreter with modularity in mind. 


4.1.4 Sample usage 


In addition to showing a sample program, we would 
also like to demonstrate a sample usage of our interpreter 
API. In order to use the sample ZKPDL program from 
Section 4.1, one could use the following C++ code (as- 
suming there are already numerical variables named x2 
and x3, and a group named G): 


group_map 9g; 

variable_map v; 

gi"G"] = G; 

VER 2] = 2s 

Vio x3 | S83 

InterpreterProver prover; 

// compiles program with groups 

prover.check("sample.zkp", g); 

// computes intermediate values needed for proof 

prover.compute(v) ; 

// computes and outputs proof 

ProofMessage proofMsg = (prover.getPublicVariables(), 
prover.computeProof()); 


The method is the same for all programs: any nec- 
essary groups and/or variables are inserted into the ap- 
propriate maps, which are then passed to the interpreter. 
Note that the group map in this case is passed to the in- 
terpreter at “compile time” so that it may pre-compute 
powers of group generators to be used for exponentia- 
tion optimizations (described in the next section); how- 
ever, both the group and variable maps may be provided 
at “compute time.” Any syntactic errors will be caught at 
compile time, but if the inputs provided at compute time 
are not valid for the relations being proved, the proof will 
be computed anyway and the error will be caught by the 
verifier. The ProofMessage is a Serializable container 


19th USENIX Security Symposium —= 197 


198 


for the zero-knowledge proof and any intermediate val- 
ues (e.g., commitments and group bases) that the verifier 
might need to verify the proof. 


The method is almost identical for the verifier: 


group_map g; 

variable_map v; 

gL"G"] = G; 

InterpreterVerifier verifier; 
verifier.check("sample.zkp", g); 
verifier.compute(v, proofMsg.publics) ; 

bool verified = verifier.verify(proofMsg. proof) ; 


AS we can see, the main difference is that the verifier 
uses both its own public inputs and the prover’s public 
values at compute time (with its own inputs always tak- 
ing precedence over the ProofMessage inputs), but still 
takes in the proof to be checked afterwards so that the 
actions of the prover and verifier remain symmetric. 


4.2 Optimizations 


In our interpreter, we have incorporated a number of 
optimizations that make using our language not only 
more convenient but also more efficient. Here we de- 
scribe the most significant optimizations, which include 
removing any redundancy when multiple proofs are com- 
bined and performing multi-exponentiations on cached 
bases when the same bases are used frequently. Other 
improvements specific to existing protocols can be found 
in Section 5. 


4.2.1 Translation 


To eliminate redundancy between different proofs, we 
first translate each proof described in Section 4.1.3 into 
a “fundamental discrete logarithm form.” In this form, 
each proof can be represented by a collection of equa- 
tions of the form A = B*” - CY. For example, if the 
prover would like to prove that the value x contained 
within C’, = g”h"* is equal to the product of the values 
y and z contained within C,, = g¥h™ and C, = g*h" 
respectively, this is equivalent to a proof of knowledge 
of the discrete logarithm equalities C, = g¥%h"¥ and 
Cy = CfAale 2", 

Our sample program in the previous section is first 
translated into this discrete logarithm form. During run- 
time, the values provided to the prover are then used to 
generate the zero-knowledge proof. In addition to elim- 
inating redundancy between proofs of different relations 
in the program, this technique also allows our language 
to easily add new types of proofs as they become avail- 
able. To add any proof that can be broken down into this 
discrete logarithm form, we need to add only a transla- 
tion function and arule in the grammar for how we would 
like to specify this proof in a program, and the rest of the 
work will be handled by our existing framework. 
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4.2.2 Multi-exponentiation 


The computational performance of many crypto- 
graphic protocols, especially those used by our library, 
is often dominated by the need to perform many modular 
exponentiation operations on large integers. These op- 
erations typically involve the use of systems parameters 
as bases, with exponents chosen at random or provided 
as private inputs (e.g., Pedersen commitments, which re- 
quire computation of g® -h”, where g and hf are publicly 
known). Algorithms for simultaneous multiple expo- 
nentiation allow the result of multi-base exponentiations 
such as these to be computed without performing each 
intermediate exponentiation individually; an overview of 
these protocols can be found in Section 14.6 of Menezes 
et al. [58]. 

Our interpreter leverages the descriptions of mathe- 
matical expressions in ZKPDL programs to recognize 
when fixed-base exponentiation operations occur, allow- 
ing it to precompute lookup tables at compile time that 
can speed up these computations dramatically. In addi- 
tion to single-table multi-exponentiation techniques (i.e., 
the 2“-ary method [58]), we offer programmers who 
expect to run the same protocol many times the abil- 
ity to take advantage of time/space tradeoffs by gener- 
ating large lookup tables of precomputed powers. This 
allows a programmer to choose parameters that balance 
the memory requirements of the interpreter against the 
need for fast exponentiation. 

For single-base exponentiation, we employ window- 
based precomputation techniques similar to those used 
by PBC [56] to cache powers of fixed bases. For multi- 
base exponentiation of & exponents, we currently extend 
the 2”-ary method to store 2’”’-sized lookup tables for 
each w-bit window of the expected exponent length, so 
that multi-exponentiations on exponents of length n re- 
quire only n/w multiplications of stored values. While 
we are also evaluating other algorithms offering similar 
time-space tradeoffs, we demonstrate the performance 
gains afforded by these techniques later in Table 1. 


4.2.3 Interpreter caching 


We also cache the parsed, compiled environments of 
ZKPDL programs when they are first run. Because we 
accept system parameters at compile time, we are able to 
evaluate and propagate any subexpressions made up of 
fixed constants and perform exponentiation precomputa- 
tions before these expressions are fully evaluated at run- 
time. Even without the use of large tables for fixed-based 
exponentiation, this optimization proves useful when re- 
peated executions of the same program must be per- 
formed; e.g., for a bank dealing with e-coin deposits. In 
this case, a bank must invoke the interpreter for each coin 
deposited; looking ahead to Table | we see that, on aver- 
age, this operation takes the bank 83ms. If our program 


USENIX Association 


were re-parsed each time, it would take an extra 10ms, 
as opposed to the fraction of a millisecond required to 
load a cached interpreter environment, saving the bank 
approximately 10% of computation time per transaction 
by avoiding parsing overhead. 


5 Sample Programs and Performance 


Using our language, we have written programs for a 
wide variety of cryptographic primitives, including blind 
signatures [24], verifiable encryption [27], and endorsed 
e-cash [26]. In the following sections, we provide our 
programs for these three primitives; in addition, perfor- 
mance benchmarks for all of them can be found at the 
end of the section. 


5.1 CL signatures 


Using our language, we have implemented the 
blind signature scheme due to Camenisch and Lysyan- 
skaya [24]; as we will see in Section 5.3, CL signatures 
are integral to endorsed e-cash. Briefly, a blind signa- 
ture, aS introduced by Chaum [28], enables a signature 
issuer to sign a message without learning the contents of 
the message. A CL signature works in two main phases: 
an issuing phase, in which a user actually obtains the sig- 
nature, and a proving phase, in which the user is able to 
prove (in zero-knowledge) to other users that he does in 
fact possess a valid CL signature. 

The issuing phase is a one-round interaction between 
the recipient and the issuer, at the end of which the recip- 
ient obtains the blind signature on her message(s). Be- 
cause the protocol is interactive, we present one program 
for each stage of the protocol. At the end of this first 
stage, the signature issuer will return a partial signature 
to the recipient, who will then use this signature to com- 
pute the full signature on the hidden message. 


cl-recipient-proof.zkp 








computation: 
given: 
group: pkGroup = <fprime, gprime[1:L+k], hprime> 
exponents in pkGroup: x[1:L] 
integers: stat, modSize 
compute: 
random integer in [0,2(modSize+stat)): vprime 
C := hprime4vprime * for(i, 1:L, *, gprime_i*x_i) 
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proof: 
given: 
group: pkGroup = <fprime, gprime[1:L+k], hprime> 
group: comGroup = <f, g, h, hl, h2> 
element in pkGroup: C 
elements in comGroup: c[1:L] 
for(i, 1:L, commitment to x_i: c_i=g4x_i*h/r_i) 
integer: l1_x 
prove knowledge of: 
integers: x[1:L] 
exponents in comGroup: r[1:L] 
exponent in pkGroup: vprime 
such that: 
for(i, 1:1, range: (-(241_x-1)) <= x_i < 241_x) 
C = hprime4vprime * for(i, 1:L, gprime_i‘x_i) 
for(i, 1:L, c_i = gsx_i * h4r_i) 
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Next, the issuer must prove the partial signature is 
computed correctly, as in the following program. 


cl-issuer-proof.zkp 








1] computation: 
given: 
group: pkGroup = <f, g[1:L+k], h> 
element in pkGroup: C 
exponents in pkGroup: x[1:k+L] 
integers: stat, modSize, 1x 
compute: 
random integer in [0,24(modSize+1x+stat)): vpp 
random prime of length 1x+2: e 
einverse := 1/e 
A := (£*C*h4vpp * for(i,L+1:k+L,*,g_i*x_i))/einverse 
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proof: 
given: 
group: pkGroup = <f, g[1:L+k], h> 
elements in pkGroup: A, C 
exponents in pkGroup: e, vpp, x[L+1:k] 
prove knowledge of: 
exponents in pkGroup: einverse 
such that: 
A = (£*C*h4vpp * for(i,L+1:k+L,*,g_i*x_i))“einverse 











Once the recipient obtains the partial signature, she 
can unblind it to obtain a full signature; this step com- 
pletes the issuing phase. 


Now, the owner of a CL signature needs a way to prove 
that she has a signature, without revealing either the sig- 
nature or the values. To accomplish this, the prover first 
randomizes the CL signature and then attaches a zero- 
knowledge proof of knowledge that the randomized sig- 
nature corresponds to the original signature on the com- 
mitted message. 


cl-possession-proof.zkp 





computation: 
given: 
group: pkGroup = <fprime, gprime[1:L+k], hprime> 
element in pkGroup: A 
exponents in pkGroup: e, v, x[1:L] 
integers: modSize, stat 
compute: 
random integers in [0,24(modSize+stat)): r, r_C 
vprime := v + r*e 
Aprime := A * hprime4r 
C := h4r_C * fordi, 1:L, *, gprime_i4x_i) 
D := for(i, L+1:L+k, *, gprime_i’x_i) 
£CDo 3] fs" D 
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proof: 
given: 
group: pkGroup = <fprime, gprime[1:L+k], hprime> 
group: comGroup = <f, g, h, hl, h2> 
elements in pkGroup: C, D, Aprime, f£CD 
elements in comGroup: c[1:L] 
for(i, 1:L, commitment to x_i: c_i=g/x_i*h/r_i) 
exponents in pkGroup: x[L+1:L+k] 
integer: l1_x 
prove knowledge of: 
integers: x[1:L] 
exponents in comGroup: r[1:L] 
exponents in pkGroup: e, vprime, r_C 















28 such that: 
29 for(i, 1:L, range: (-(241_x - 1)) <= x_i < 241_x) 
30 C = hprime‘r_C * for(i, 1:1, *, gprime_i‘x_i) 






fori: Lik e€oi- = ‘gt4x.a * hArey) 
£CD = (Aprime‘e) * hprime’(r_C - vprime) 
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5.2 Verifiable encryption 


Briefly, verifiable encryption consists of a ciphertext 
under the public key of some trusted third party (in our 
case, the arbiter) and a zero-knowledge proof that the val- 
ues inside the ciphertext satisfy some relation; this pair is 
often referred to as a verifiable escrow. Our implementa- 
tion of verifiable encryption is based on the construction 
of Camenisch and Shoup [27]. The main use of verifi- 
able encryption in e-cash is to allow a user to verifiably 
encrypt the opening of a commitment under the public 
key of the arbiter. A recipient of such a verifiable escrow 
can then verify that the encrypted values correspond to 
the opening of the commitment. 

verifiable-encryption.zkp 
1] computation: 
2 given: 
3 group: secondGroup = <g[1:m], h> 
4 group: RSAGroup 
5 modulus: N 
6 group: G 
7 group: cashGroup = <f_3, gprime, hprime, f_1, f_2> 
8 exponents in G: x[1:m] 
9 elements in G: u[l:m], v, w 


10 compute: 
11 random integer in [0,N/4): s 


12 random exponents in secondGroup: r[1:m] 
13 fori, Tim; Get. t=] gl 4x2 * gu24r_ ti) 
14 Xprime := for(i, lim, *, g_i‘x_i) * hs 
15 vsquared := v2 
16 wsquared := wi2 
17 for(i, 1:m, usquared_i := u_i42) 
18 
19| proof: 
20 given: 
21 group: secondGroup = <g[1:m], h> 
22 group: G 
23 group: RSAGroup 
24 modulus: N 
25 group: cashGroup = <f_3, gprime, hprime, f_1, f_2> 
26 element in cashGroup: X 
a7 elements in secondGroup: Xprime, c[1:m] 
28 for(i,1:m,commitment to x_i: c_i=g_1‘x_i*g_24r_i) 
29 elements in G: a[l:m], b, d, e, f, usquared[1:m], 
30 vsquared, wsquared 
31 prove knowledge of: 
32 integers: x[1;M], r 
33 exponent in G: hash 
34 exponents in secondGroup: r[1:m], s 
35 such that: 
36 for(i, 1l:m, range: -N/2 + 1 <= x_i < N/2) 
37 vsquared = f4(2*r) 
38 wsquared = (d * eAhash)4(2*r) 
39 for(i, 1:m, usquared_i = b4(2*x_i) * a_i4(2*r)) 
40 Xo= torG:, lim, £214 eo) 
Al Xprime = for(i, l:m, *, g_i‘x_i) * h4s 
5.3. E-cash 


Electronic cash, or e-cash for short, was first intro- 
duced by Chaum [28] and can be thought of as the 
electronic equivalent of cash; i.e., an electronic cur- 
rency that preserves users’ anonymity, as opposed to 
electronic checks [30] or credit cards. We implement 
endorsed e-cash, due to Camenisch, Lysyanskaya, and 
Meyerovich [26] (which is an extension of compact e- 
cash [21]), for two main reasons. Our first reason is 
that an endorsed e-coin can be split up into two parts, its 
endorsement and an unendorsed component; only with 
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both of these parts can the coin be considered complete. 
As we will see in Section 6.2.1, this property enables 
efficient fair exchange. The second reason for choos- 
ing endorsed e-cash is that it is offline, which means 
the bank does not need to be active in every transac- 
tion; this significantly reduces the burden placed on the 
bank. Although the bank does not check the coin in 
every interaction, endorsed e-cash has the property that 
double-spenders (i.e., users who try to spend the same 
coin twice) can be caught by the bank at the time of de- 
posit and punished accordingly. Because e-cash is meant 
to preserve privacy, however, a user is also guaranteed 
that unless she double spends a coin, her identity will be 
kept secret. 

During the withdrawal phase of endorsed e-cash, a 
user contacts the bank. Before withdrawing, the user will 
have registered with the bank by storing a commitment. 
In order to prove her identity, then, the user will provide 
a proof that she knows the opening of the registered com- 
mitment. This can be accomplished using the following 
simple program: 


user-id-proof.zkp 








proof: 
given: 
group: cashGroup = <f,g,h,hl,h2> 
elements in cashGroup: A, pk_u 
commitment to sk_u: A = g4sk_u * h4r_u 
prove knowledge of: 
exponents in cashGroup: sk_u, r_u 
such that: 
pk_u = g4sk_u 
A = g4sk_u * h4r_u 
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Once the bank has verified this proof, the user and the 
bank will run a protocol to obtain a CL signature (us- 
ing the programs we saw in Section 5.1) on the user’s 
identity and two pseudo-random function seeds. These 
private values and the signature on them define a wallet 
that contains W coins (where W is a system-wide public 
parameter). 

When a user wishes to spend one of her coins, she 
splits it up into its unendorsed part and the endorsement. 
She then sends the unendorsed component to a merchant 
and proves it is valid. If the merchant then sends her what 
she wanted to buy, she will follow up with the endorse- 
ment to complete the coin and the transaction is com- 
plete. The following program is used for proving the va- 
lidity of a coin. 


coin-proof.zkp 





1] computation: 
2 given: 
3 group: cashGroup = <f, g, h, hl, h2> 
4 exponents in cashGroup: s, t, sk_u 
5 integer: J 
6 compute: 
7 random exponents in cashGroup: r_B, r_C, r_D, xl, 
8 XZ cy, 
9 alpha := 1 / (s + J) 
10 beta := 1 / (t+ J) 
11 C := g4s * hAr_C 
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DSA. = her) 
y := hiAxl * h24x2 * fAr_y 
B := g4sk_u * h4r_B 
S := gsalpha * g/xl 
T := g4sk_u * (g4R)Abeta * gx2 
proof: 
given: 


group: cashGroup = <f, g, h, hl, h2> 
elements in cashGroup: y, S, T, B, C, D 
commitment to sk_u: B = g4sk_u * h4r_B 
commitment to s: C = g4s * hAr_C 
commitment to t: D = gst * h4r_D 
integer: J 
prove knowledge of: 
exponents in cashGroup: xl, x2, r_y, sk_u, alpha, 
beta, s, t, r_B, r_C, r_D, R 
such that: 
y = hl4xl1 * h24x2 * fAr_y 
gsalpha * g4xl 
g4sk_u * (g4R)Abeta * g4x2 
(g4J * C)sAalpha * h4(-r_C / (s+J)) 
(g4J * D)Abeta * h4C-r_D / (t+J)) 


S 
T 
g 
g 





5.4 Performance 


Here we measure the communication and computa- 
tional resources used by our system when running each 
of the programs above. The benchmarks presented in Ta- 
ble 1 were collected on a MacBook Pro with a 2.53GHz 
Intel Core 2 Duo processor and 4GB of RAM running 
OS X 10.6; we therefore expect that these results will 
reflect those of a typical home user with no special cryp- 
tographic hardware support. 

As for speed, caching exponents of fixed bases re- 
sults in a significant performance increase, making it 
an important optimization for applications that require 
repeated protocol executions. The only caveat is that 
the exponent cache required for complex protocols can 
grow to hundreds of megabytes (using faster-performing 
parameters), and so our library allows users to choose 
whether to use caching, and if so how much of the cache 
should be used by this optimization. 

The time taken for the higher-level protocols provides 
a clear view of the complexity of each protocol. For ex- 
ample, the marked difference between the time required 
to generate a CL issuer proof and a CL possession proof 
can be attributed to the fact that a CL issuer proof re- 
quires proving only one discrete log relation, while a CL 
possession proof on three private values requires three 
range proofs and five more discrete log relations. 

Table 1 also shows that verifiable encryption is by 
far the biggest bottleneck, requiring almost three times 
as much time to compute as any other step. As seen 
in the program in Section 5.2, there is one range proof 
performed for each value contained in the verifiable es- 
crow. In order to perform a range proof, the value con- 
tained in the range must be decomposed as a sum of four 
squares [65]. Because the values used in our verifiable 
encryption program are much larger than the ones used 
in CL signatures (about 1024 vs. 160 bits, to get 80-bit 
security for both), this decomposition often takes con- 
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siderably more time for verifiable encryption than it does 
for CL signatures. Furthermore, since the values being 
verifiably encrypted are different each time, caching the 
decomposition of the values wouldn’t be of any use. 

A final observation on computational performance is 
that proving possession of a CL signature completely 
dominates the time required to prove the validity of a 
coin, since the timings for the two proofs are within mil- 
liseconds. This suggests that the only way to signifi- 
cantly improve the performance of e-coins and verifiable 
encryption would be to develop more efficient techniques 
for range proofs (which has in fact been the subject of 
some recent cryptographic research [48, 18, 67]). 

In terms of proof size, range proofs are much larger 
than proofs for discrete logarithms or multiplication. 
This is to be expected, as translating a range proof into 
discrete logarithm form (as described in Section 4.2) re- 
quires eleven equations, whereas a single DLR proof re- 
quires only one, and a multiplication proof requires two. 


6 Implementation of Cashlib 


Using the primitives described in the previous section, 
we wrote a cryptographic library designed for optimistic 
fair exchange protocols. Fair exchange [31] involves a 
situation in which a buyer wants to make sure that she 
doesn’t pay a merchant unless she gets what she is buy- 
ing, while the merchant doesn’t want to give away his 
goods unless he is guaranteed to be paid. It is known 
that fair exchange cannot be done without a trusted third 
party [62], but optimistic fair exchange [2, 3] describes 
the cases in which the trusted third party has to get in- 
volved only in the case of a dispute. 

The library was written in C++ and consists of approx- 
imately 11000 lines of code in addition to the interpreter. 
A previous version of the library in which all the pro- 
tocols and proofs were hand-coded (i.e., the interpreter 
was not used) consisted of approximately 20000 lines 
of code, meaning that the use of roughly 400 lines of 
ZKPDL was able to replace 9000 lines of our original 
C++ code (and, as we will see, make our operations more 
efficient as well). 


6.1 Endorsed e-cash 


A description of endorsed e-cash can be found in Sec- 
tion 5.3; the version used in our library, however, con- 
tains a number of optimizations. Just as with real cash, 
we now allow for different coin denominations. Each 
coin denomination corresponds to a different bank pub- 
lic key, so once the user requests a certain denomination, 
the wallet is then signed using the corresponding public 
key. A coin generated from such a wallet will verify only 
when the same public key of the bank is used, and thus 
the merchant can check for himself the denomination of 
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pases Prover (ms) Verifier (ms) Multi-exps 
With cache | Without | With cache | Without (bytes) (Mbytes) Prover | Verifier 
DLR proof 3.07 3.08 1.26 1.25 511 0 
Multiplication proof 2.03 4.07 1.66 252 848 33.5 
Range proof 36.36 74.52 21.63 31.54 5455 33.5 3] 1] 
CL recipient proof 119.92 248.31 70.76 112.13 19189 134.2 104 39 
CL issuer proof 7.29 7.38 1.73 1.73 1097 0 2 1 
CL possession proof 125.89 253.17 78.19 117.67 19979 134.2 109 40 
Verifiable encryption 416.09 617.61 121.87 162.77 24501 190.2 113 42 
Coin 134.37 271.34 83.01 121.83 22526 223.7 122 45 


Table 1: Time (in milliseconds) and size (in bytes) required for each of our proofs, averaged over twenty runs. Timings are 
considered from both the prover and verifier sides, as are the number of multi-exponentiations, and are considered both with and 
without caching for fixed-based exponentiations; the size of the cache is also measured (in megabytes). As we can see, using 
caching results on average in a 48% speed improvement for the prover, and a 31% improvement for the verifier. 


the coin. 

The program in Section 5.3 also reflects our decision 
to randomize the user’s spending order rather than hav- 
ing them perform a range proof that the coin index was 
contained within the proper range. As the random spend- 
ing order does not reveal how many coins are left in the 
wallet, the user’s privacy is still protected even though 
the index is publicly available. Furthermore, because 
range proofs are slow and require a fair amount of space 
(see Table | for a reminder), this optimization resulted in 
coins that were 20% smaller and 21% faster to generate 
and verify. 

Finally, endorsed e-cash requires a random value con- 
tributed by both the merchant and the user. Since e-coin 
transactions should be done over a secure channel, in 
practice we expect that SSL connections will be used be- 
tween the user and the merchant. One useful feature of 
an SSL connection is that it already provides both parties 
with shared randomness, and thus this randomness can 
be used in our library to eliminate the need for a redun- 
dant message. 


6.2 Buying and Bartering 


Our library implements two efficient optimistic fair 
exchange protocols for use with e-cash. Belenkiy et 
al. [9] provide a buy protocol for exchanging a coin 
with a file, while Kiipcii and Lysyanksaya [51] provide 
a barter protocol for exchanging two files or blocks. The 
two protocols serve different purposes (buy vs. barter) 
and so we have implemented both. 

Two of the main usage scenarios of fair exchange pro- 
tocols are e-commerce and peer-to-peer file sharing [9]. 
In e-commerce, one needs to employ a buy protocol to 
ensure that both the user and the merchant are protected; 
the user receives her item while the merchant receives 
his payment. In a peer-to-peer file sharing scenario, peers 
exchange files or blocks of files. In this setting, it is more 
beneficial to barter for the blocks than to buy them one at 
a time; for an exchange of n blocks, buying all the blocks 
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requires O(n) verifiable escrow operations (which, as 
discussed in Section 5.4, are quite costly), whereas bar- 
tering for the blocks requires only one such operation, 
regardless of the number of blocks exchanged. 


Although the solution might seem to be to barter all 
the time and never buy, Belenkiy et al. suggest that both 
protocols are useful in a peer-to-peer file sharing sce- 
nario. Peers who have nothing to offer but would still like 
to download can offer to buy the files, while peers who 
would like only to upload and have no interest in down- 
loading can act as the merchant and earn e-cash. Due to 
the resource considerations mentioned above, however, 
bartering should always be used if possible. 


Because peers do not always know beforehand if they 
want to buy or barter for a file, we have modified the buy 
protocol to match up with the barter protocol in the first 
two messages. This modification, as well as outlines of 
both the protocols, can be seen in Figure 3. We further 
modified both protocols to let them exchange multiple 
blocks at once, so that one block of the fair exchange 
protocol might correspond to multiple blocks of the un- 
derlying file. 

We give an overview of each protocol below, with the 
optimizations we have added. We have also implemented 
the trusted third parties (the bank and the arbiter) neces- 
sary for e-cash and fair exchange. Although we do not 
describe in detail the resolution and bank interaction pro- 
tocols, these can be found in the original papers [9, 51] 
and we provide performance benchmarks for the bank in 
Table 2. 


6.2.1 Buying 


The modified buy protocol is depicted on the left in 
Figure 3, although we also allow for the users to partici- 
pate in the original buy protocol (in which the messages 
appear in a slightly different order). To initiate the mod- 
ified buy protocol, the buyer sends a “setup” message, 
which consists of an unendorsed coin and a verifiable es- 
crow on its corresponding endorsement. Upon receiving 
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Figure 3: This figure provides outlines of both our buy and 
barter protocols [9, 51]. Until the decision to buy or barter, 
the two protocols are identical; the main difference is that in 
a buy protocol, the setup message must be sent for each file 
exchange, which results in a linear efficiency loss as compared 
to bartering. 


this message, the seller will use the programs in Section 5 
to check the validity of the coin and the escrow. If these 
proofs verify, the seller will proceed by sending back an 
encrypted version of his file (or file block). Upon receiv- 
ing this ciphertext, the buyer will store it (and a Merkle 
hash of it, for use with the arbiter in case the protocol 
goes wrong later on) and send back a contract, which 
consists of a hash of the seller’s file and some session 
information. The seller will check this contract and, if 
satisfied with the details of the agreement, send back its 
decryption key. The buyer can then use this key to de- 
crypt the ciphertext it received in the second message of 
the protocol. If the decryption is successful, the buyer 
will send back his endorsement on the coin. If in these 
last steps either party is unsatisfied (for example, the file 
does not decrypt or the endorsement isn’t valid for the 
coin from the setup message), they can proceed to con- 
tact the arbiter and run resolution protocols [9]. 


6.2.2 Bartering 


This protocol is depicted on the right in Figure 3; be- 
cause the first two messages of the barter protocol (the 
setup message and the encrypted data) are identical to 
those in the buy protocol described in the previous sec- 
tion, we do not describe them again here and instead 
jump directly to the third message. Because bartering 
involves an exchange of data, the initiator will respond 
to the receipt of the ciphertext with a ciphertext of her 
own, corresponding to an encryption of her file. She will 
also send a contract, which is similar to the buy contract 
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but also contains hash information for her file. The re- 
sponder will then check this contract as the seller did in 
the buy protocol, and if satisfied with the agreement will 
send back his decryption key. If the ciphertext decrypts 
correctly (i.e., decrypts to the file described in the con- 
tract) then the initiator can respond in turn with her own 
decryption key. If this decryption Key is also valid, both 
parties have successfully obtained the desired files and 
the barter protocol can be considered complete. If nei- 
ther party had to contact the arbiter (for similar reasons 
as in the buy protocol; 1.e., a file did not decrypt cor- 
rectly) then they are free to engage in future barter proto- 
cols without the overhead of an additional setup message. 
Otherwise, they need to resolve with the arbiter [51]. 


6.3. Library performance 


In Table 2, we can see the computation time and size 
complexity for the steps described above, as well as com- 
putation and communication overhead for the withdraw 
and deposit protocols involving the bank. The numbers 
in the table were computed on the same computer as 
those in Section 5.4. 


The numbers in Table 2 clearly demonstrate our earlier 
observation that bartering is considerably more efficient 
than buying, both in terms of computation and commu- 
nication overhead. The setup message for both buying 
and bartering takes about 600ms to generate and approx- 
imately 46kB of space. In contrast, the rest of the barter 
protocol takes very little time; on the order of millisec- 
onds for both parties (and about 1.5kB of total overhead). 


In addition, we consider the same protocols run us- 
ing a previous “naive” version of our library, which pro- 
vided the same e-cash API and employed some multi- 
exponentation optimizations, but did not use ZKPDL. 
Using the optimizations available to the interpreter is 
considerably faster over our previous approach, mean- 
ing that our interpreter has not only made developing our 
protocols more convenient, but has also helped to 1m- 
prove efficiency. 


7 Related work 


Similar to our approach, FairPlayMP [13] (and its pre- 
decessor, FairPlay [57]) provides a language-based sys- 
tem for secure multi-party computation, allowing multi- 
ple parties to jointly compute a function on private inputs 
while revealing nothing but the resulting value. At the 
heart of FairPlayMP is a programming language, SFDL 
2.0 (short for Secure Function Definition Language), that 
allows programmers to specify a multi-party computa- 
tion. The authors provide a compiler that transforms 
SFDL programs into boolean circuits, and an engine that 
securely evaluates these circuits and distributes the re- 
sulting values among the involved parties. Although 
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Size B) 
Withdraw (user) 126.35 290.79 20093 
Withdraw (bank) 83.36 140.02 1167 


Deposit (bank) 82.11 128.36 22526 


Buying a block (buyer) 


Buying a block (seller) 
Barter setup message 
Checking setup message 
Barter after setup (initiator) 
Barter after setup (responder) 


628.49 901.04 47286 
211.89 275.94 203 





Table 2: Average time required and network overhead, in milliseconds and bytes respectively, for each stage in our e-cash imple- 
mentation. The timings were averaged over twenty runs, and caching and compression optimizations were used. For the naive 
timings, an older version of the library was used, which uses some multi-exponentation optimization techniques but not the inter- 
preter; we can see a clear improvement when using ZKPDL. Parameters were used to provide a security level of 80 bits (160-bit 
SHA-1 hashing, 128-bit AES encryption, 1024-bit RSA moduli, and 1024-bit DSA signatures). 


this is a very useful tool, it uses generic circuit tech- 
niques, and thus from an efficiency standpoint it is often 
desirable to instead develop a multi-party computation 
scheme specific to the intended application. 

IBM’s Idemix (identity mixer) project [19, 14] has 
independently developed a library for zero-knowledge 
proofs and anonymous credentials using Java. Idemix 
has focused on supporting the deployment of anonymous 
credentials in privacy-preserving identity systems, and 
provides a system for obtaining, proving, and verifying 
credentials using XML messages. The Idemix team has 
also developed a high-level language for zero-knowledge 
protocols, and describe a proof-of-concept compiler that 
can output Java or TEX code from these descriptions [5, 
4]. However they do not provide performance bench- 
marks, and many implementation details are left as fu- 
ture work; neither the language nor the compiler appear 
in the released Idemix library. While Idemix and our 
work both provide implementations of anonymous cre- 
dentials and CL signatures, in contrast, our focus on ef- 
ficient, repeated execution of e-cash transactions has led 
us to pursue an alternate language-based strategy and de- 
velop a performance-optimized interpreter engine. We 
believe our runtime engine, ease of extensibility, and per- 
formance optimizations provide greater support to cryp- 
tographic researchers and systems programmers seeking 
a framework for deploying zero-knowledge protocols. 

There are also compilers available [15, 6] for the gen- 
eration of proofs of security and correctness for crypto- 
graphic protocols. While this is an interesting and impor- 
tant area of research, these tools largely focus on static 
analysis of protocols rather than performance. Perhaps 
more similar to our work, the languages Cryptol [53] 
and Stupid [52] provide a simple interface for develop- 
ing low-level implementations of cryptographic primi- 
tives (such as hash functions) which can then be analyzed 
and translated into native code on different platforms. 
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$8 Conclusions and Future Work 


In this paper we have introduced a language for gener- 
ating (and verifying) widely-used zero-knowledge proofs 
of knowledge. Through sample programs, we have 
demonstrated how our language is used to express ad- 
vanced cryptographic primitives such as blind signatures, 
verifiable encryption, and endorsed e-cash. We presented 
optimizations provided by our language’s interpreter and 
have shown they provide significant benefit. 


Atop our language framework, we built a library that 
provides optimistic fair exchange protocols based on 
electronic cash. We have further presented optimizations 
for the protocols provided by Cashlib and argued for their 
practicality in network-based applications. 


Much future work is possible for the ZKPDL lan- 
guage and interpreter. There are many other cryp- 
tographic primitives which could be incorporated into 
the language (e.g., encryption, signatures, hash func- 
tions), and other zero-knowledge protocols that could be 
added as relations (e.g., alternate and “fuzzy” schemes 
for range proofs). Incorporating these primitives, per- 
haps by allowing for subroutines and the composabil- 
ity of ZKPDL programs, would allow our library to be 
more easily extended and potentially have applicability 
to a broader range of secure systems. The analysis of 
ZKPDL programs—e.g., to automatically verify proto- 
cols and identify security errors through type analysis or 
formal verification techniques—provides another inter- 
esting area of study. 


For increased performance on multicore architectures, 
we are working on analyzing dependencies among the 
expressions evaluated by our interpreter. The simplic- 
ity of our language, e.g., in compute blocks, allows a 
coarse-grained approach, as the only dependencies that 
arise between lines of ZKPDL are from variables which 
have been declared and assigned in previous lines. 
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Finally, in terms of extending Cashlib, to improve a 
bank’s efficiency it might also be possible to speed up 
coin verification time by supporting batch verification 
techniques [22, 39] for CL signatures; we leave this as 
one of many interesting open problems. 
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Abstract 


In this paper we introduce a framework for privacy- 
preserving distributed computation that is practical for 
many real-world applications. The framework is called 
Peers for Privacy (P4P) and features a novel heteroge- 
neous architecture and a number of efficient tools for 
performing private computation and ensuring security at 
large scale. It maintains the following properties: (1) 
Provably strong privacy; (2) Adequate efficiency at rea- 
sonably large scale; and (3) Robustness against realis- 
tic adversaries. The framework gains its practicality by 
decomposing data mining algorithms into a sequence of 
vector addition steps that can be privately evaluated us- 
ing a new verifiable secret sharing (VSS) scheme over 
small field (e.g., 32 or 64 bits), which has the same cost 
as regular, non-private arithmetic. This paradigm sup- 
ports a large number of statistical learning algorithms in- 
cluding SVD, PCA, k-means, ID3, EM-based machine 
learning algorithms, etc., and all algorithms in the sta- 
tistical query model [36]. As a concrete example, we 
show how singular value decomposition (SVD), which 
is an extremely useful algorithm and the core of many 
data mining tasks, can be done efficiently with privacy 
in P4P. Using real-world data and actual implementation 
we demonstrate that P4P is orders of magnitude faster 
than existing solutions. 


1 Introduction 


Imagine the scenario where a large group of users want 
to mine their collective data. This could be a community 
of movie fans extracting recommendations from their rat- 
ings, or a social network voting for their favorite mem- 
bers. In all the cases, the users may wish not to reveal 
their private data, not even to a “trusted” service provider, 
but still obtain verifiably accurate results. The major 
issues that make this kind of tasks challenging are the 
scale of the problem and the need to deal with cheat- 
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ing users. Typically the quality of the result increases 
with the size of the data (both the size of the user group 
and the dimensionality of per user data). Nowadays it 
is common for commercial service providers to run al- 
gorithms on data set collected from thousands or even 
millions of users. For example, the well-publicized Net- 
flix Prize (http://www.netflixprize.com/) data set consists 
of roughly 100M ratings of 17,770 movies contributed 
by 480K users. At such a scale, both private computa- 
tion and verifying proper behavior become impractical 
(more on this). In other words, privacy technologies fail 
to catch up with data mining algorithms’s appetite and 
processing capability for large data sets. 


We strive to change this. Our goal is to provide a pri- 
vacy solution that is practical for many (but not all) real- 
world applications at reasonably large scale. We intro- 
duce a framework called Peers for Privacy (P4P) which 
is guided by the natural incentives of users/vendors and 
today’s computing reality. On a typical computer today 
there is a six orders of magnitude difference between the 
crypto operations in large field needed for secure homo- 
morphic computation (order of milliseconds) and regu- 
lar arithmetic operations in small (32- or 64-bit) fields 
(fraction of a nano-second). Existing privacy solutions 
such as [11, 29] make heavy use of public-key operations 
for information hiding or verification. While they have 
the same asymptotic complexity as the standard algo- 
rithms for those problems, the constant factors imposed 
by public-key operations are prohibitive for large-scale 
systems. We show in section 3.3 and section 7.2 that 
they cannot be fixed with trivial changes to support ap- 
plications at our scale. In contrast, P4P’s main compu- 
tation is based on verifiable secret sharing (VSS) over 
small field. This allows private arithmetic operations 
to have the same cost as regular, non-private arithmetic 
since both are manipulating the same-sized numbers with 
similar complexity. Moreover, such a paradigm admits 
extremely efficient zero-knowledge (ZK) tools that are 
practical even at large scale. Such tools are indispens- 
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able in dealing with cheating participants. 

Some of techniques used in P4P were initially intro- 
duced in [21]. However, the focus of [21] is to develop 
an efficient zero-knowledge proof (ZKP) (for detecting 
cheating users) and prove its effectiveness. It leaves open 
how the ZKP should be incorporated into the computa- 
tion to force proper behavior. As we will show, this is not 
trivial and requires additional tools, probably tailored to 
each application. In particular, [21] does not deal with 
the threat of cheating users changing their data during 
the computation. This could cause the computation to 
produce incorrect results. Such practical issues are not 
addressed in [21]. 

We fill in the missing pieces and provide a comprehen- 
sive solution. The contributions of this paper are: (1) We 
identify three key qualifications a practical privacy solu- 
tion must possess, examine them in light of the changes 
in large-scale distributed computing, and formulate our 
design. The analysis not only provides rationales for our 
scheme, but also can serve as a guideline for practitioners 
to appraise the cost for obtaining privacy in their appli- 
cations. (2) We introduce a new ZK protocol that ver- 
ifies the consistency of user’s data during the computa- 
tion. This protocol complements the work of [21] and 
ensures the correctness of the computation in the pres- 
ence of active user cheating. (3) We demonstrate the 
practicality of the framework with a concrete example, 
a private singular value decomposition (SVD) protocol. 
Prior to our work, there is no privacy solution provid- 
ing comparable performance at such large scales. The 
example also serves as a tutorial showing how the frame- 
work can be adapted to different applications. (4) We 
have implemented the framework and performed evalu- 
ations against alternative privacy solutions on real-world 
data. Our experiments show a dramatic performance im- 
provement. Furthermore, we have made the code freely 
available and are continuing to improve it. We believe 
that, like other secure computation implementations such 
as [46, 39, 5, 40], P4P is a very useful tool for devel- 
oping privacy-preserving systems and represents a sig- 
nificant step towards making privacy a practical goal in 
real-world applications. 


2 Preliminaries 


We say that an adversary is passive, or semi-honest, if 
it tries to compute additional information about other 
player’s data but still follows the protocol. An active, 
or malicious adversary, on the other hand, can deviate 
arbitrarily from the protocol, including inputting bogus 
data, producing incorrect computation, and aborting the 
protocol prematurely. Clearly active adversary is much 
more difficult to handle than passive ones. Our scheme 
is secure against a hybrid threat model that includes both 
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passive and active adversaries. We introduce the model 
in section 4. 

The privacy guarantee P4P provides is differential pri- 
vacy, a notion of privacy introduced in [25], further re- 
fined by [24, 23], and adopted by many latest works such 
as [9, 43, 42, 8, 41]. Differential privacy models the leak- 
age caused by releasing some function computed over a 
data set. It captures the intuition that the function is pri- 
vate if the risk to one’s privacy does not substantially in- 
crease as a result of participating in the data set. Formally 
it is defined as: 


Definition 1 (Differential Privacy [25, 24]) Ve,o > 0, 
an algorithm A gives (€, 6)-differential privacy if for all 
S C Range(A), for all data sets D, D' such that D and 
D’ differ by a single record 


Pr[A(D) € S] < exp(e) Pr[A(D’) € S] +6 


There are several solutions achieving differential privacy 
for some machine learning and data mining algorithms 
(e.g., [24, 9, 43, 42, 8, 41]). Most require a trusted server 
hosting the entire data set. Our scheme removes such a 
requirement and also provides tools for handling a more 
adversarial setting where the data sources may be mali- 
cious. [4] is also a distributed and differentially private 
scheme for binary sum functions but it is only secure in 
a semi-honest model. 

Differential privacy is widely used in the database pri- 
vacy community to model the leakage caused by answer- 
ing queries. P4P’s reliance on differential privacy is as 
follows: During the computation, certain aggregate in- 
formation (including the final result) is released (other 
information is kept hidden using cryptographic means). 
This is also modeled as query responses computed over 
the entire data set. Measuring such leakage against dif- 
ferential privacy allows us to have a rigorous formulation 
of the risk each individual user faces. By tuning the pa- 
rameters € and 6 we can control such risk and obtain a 
system with adequate privacy as well as high efficiency. 
Another nice property of using differential privacy is that 
it can cover the final results (in contrast secure MPC in 
cryptography does not) therefore the protection is com- 
plete. Integrating differential privacy into secure compu- 
tation has been accepted by the cryptography community 
[4] and our work can been seen as a concrete and highly 
efficient instantiation of such an approach to secure com- 
putation of some algorithms. 


3 Design Considerations 


Our design was motivated by careful evaluation of goals, 
available resources, and alternative solutions. 
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3.1 Design Goals 


Our goal is to provide practical privacy solutions for 
some real-world applications. To this end, we identify 
three properties that are essential to a practical privacy 
solution: 


1. Provable Privacy: Its privacy must be rigorously 
proven against well formulated privacy definitions. 


2. Efficiency and Scalability: It must have adequate 
efficiency at reasonably large scale, which is an ab- 
solute necessity for many of today’s data mining ap- 
plications. The scale we are targeting is unprece- 
dented: to support real-world application both the 
number of users and the number of data items per 
user are assumed to be in millions. 


3. Robustness: It must be secure against realistic ad- 
versaries. Many computations either involve the 
participation of users, or collect data from them. 
Cheating of a small number of users is a realistic 
threat that the system must handle. 


To the best of our knowledge, no existing works, or triv- 
ial composition of them, attain all three. Ours is the first, 
with open-source code, supporting all these properties. 


3.2 Available Resources 


During the past few years the landscape of large-scale 
distributed computing has changed dramatically. Many 
new resources and paradigms are available at very low 
cost and many computations that were infeasible at large 
scale in the past are now running routinely. One notable 
trend is the rapid growth of “cloud computing”, which 
refers to the model where clients purchase computing cy- 
cles and/or storage from a third-party provider over the 
Internet. Vendors are sharing their infrastructures and 
allowing general users access to gigantic computing ca- 
pability. Industrial giants such as Microsoft, IBM, Ya- 
hoo!, and Google are all key players in the game. Some 
of the cloud services (e.g., Amazon’s Elastic Compute 
Cloud, http://aws.amazon.com/ec2.) are already avail- 
able to general public at very cheap price. 

The growth of cloud computing symbolizes the in- 
creased availability of large-scale computing power. We 
believe it is time to re-think the issue of privacy preserv- 
ing data mining in light of such changes. There are sev- 
eral significant differences: 


1. Could computing providers have very different in- 
centives. Unlike traditional e-commerce vendors 
who are naturally interested in users data (e.g., 
purchase history), the cloud computing providers’s 
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commodity (CPU cycles and disk space) is orthogo- 
nal to users’ computation. Providers do not benefit 
directly from knowing the data or computation re- 
sults, other than ensuring that they are correct. 


2. The traditional image of client-server paradigm has 
changed. In particular, the users have much more 
control over the data and the computation. In fact in 
many cases the cloud servers will be running code 
written by the customers. This is to be contrasted 
with traditional e-commere where there is a tremen- 
dous power imbalance between the service provider, 
who possesses all the information and controls what 
computation to perform, and the client users. 


3. The servers are now clusters of hundreds or even 
thousands of machines capable of handling huge 
amount of data. They are not bottlenecks anymore. 


Discrepancy of incentives and power imbalance have 
been identified as two major obstacles for the adoption 
of privacy technology by researchers examining privacy 
issues from legal and economic perspectives [26, 1]. In- 
terestingly, both are greatly mitigated with the dawn of 
cloud computing. While traditional e-commerce ven- 
dors are reluctant to adopt privacy technologies, cloud 
providers would happily comply with customers instruc- 
tions regarding what computation to perform. And once 
a treasure for the traditional e-commerce vendors, user 
data is now almost a burden for the cloud computing 
providers: storing the data not only costs disk space, but 
also may entail certain liability such as hosting illegal in- 
formation. Some cloud providers may even choose not to 
store the data. For example, with Amazon’s EC2 service, 
user data only persists during the computation. 

We believe that cloud computing offers an extremely 
valuable opportunity for developing a new paradigm of 
practical privacy-preserving distributed computation: the 
existence of highly available, highly reputable, legally 
bounded service providers also provides a very important 
source of security. In particular, they make it realistic 
to treat some participants as passive adversaries. (The 
rests are still handled as active adversaries. The model 
is therefore a heterogenous one.) By tapping into this 
resource, we can build a heterogeneous system that can 
have privacy, scalability and robustness all at once. 


3.3. The Alternatives 


Existing privacy solutions for distributed data mining can 
be classified into two models: distributed and server- 
based. The former is represented by a large amount of 
work in the area of secure multiparty computation (MPC) 
in cryptography. The latter includes mostly homomor- 
phic encryption-based schemes such as [11, 22, 51]. 
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Generic MPC: MPC allows n players to compute a 
function over their collective data without compromising 
the privacy of their inputs or the correctness of the out- 
puts even when some players are corrupted by the same 
adversary. The problem dates back to Yao [52] and Gol- 
dreich et al. [31], and has been extensively studied in 
cryptography [6, 2, 33]. Recent years see some signifi- 
cant improvement in efficiency. Some protocols achieve 
nearly optimal asymptotic complexity [3, 16] while some 
work in small field [12]. 

From practitioners’ perspective, however, these 
generic MPC protocols are mostly of theoretical interest. 
Reducing asymptotic complexity does not automatically 
make the schemes practical. These schemes tend to be 
complex which imposes a huge barrier for developers not 
familiar with this area. Trying to support generic compu- 
tation, most of them compile an algorithm into a (boolean 
or arithmetic) circuit. Not only the depth of such a cir- 
cuit can be huge for complex algorithms, it is also very 
difficult, if not entirely impossible, to incorporate exist- 
ing infrastructures and tools (e.g., ARPACK, LAPACK, 
MapReduce, etc.), into such computation. These tools 
are indispensable part of our daily computing life and 
symbolize the work of many talents over many years. 
Re-building production-ready implementations is costly 
and error-prone and generally not an option for most 
companies in our fast-pacing modern world. 

Recently there are several systems that implemented 
some of the MPC protocols. While this reflects a plausi- 
ble attempt to bridge the gap between theory and prac- 
tice, unfortunately, performance-wise none of the sys- 
tems came close to providing satisfactory solutions for 
most large-scale real-world applications. Table 1 shows 
some representative benchmarks obtained by these im- 
plementations. Using FairplayMP [5] as an example, 
adding two 64-bit integers is compiled into a circuit of 
628 gates and 756 wires using its SFDL compiler. Ac- 
cording to [5]’s benchmark, evaluating such a circuit 
between two players takes about 7 seconds. With this 
performance, adding 10° vectors of dimensionality 10° 
each, which constitutes one iteration in our framework, 
takes 7 x 10! seconds, or 221,969 years. 


ECC and a single server: It has been shown that con- 
ventional client-server paradigm can be augmented with 
homomorphic encryption to perform some computations 
with privacy (e.g., [11, 22, 51]). Still, such schemes are 
only marginally feasible for small to medium scale prob- 
lems due to the need to perform at least linear number of 
large field operations even in purely semi-honest model. 
Using elliptic curve cryptography (ECC) can mitigate the 
problem as ECC can reduce the size of the cryptographic 
field (e.g., a 160-bit ECC key provides the same level 
of security as a 1024-bit RSA key). ECC cryptosystems 
such as [44] are (+, +)-homomorphic which is ideal for 
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private computation. However, ECC point addition re- 
quires | field inversion and several field multiplications. 
The operation is still orders of magnitude slower than 
adding 64-bit or 32-bit integers directly. According to 
our benchmark, inversion and multiplication in a 160-bit 
field take 0.0224 and 0.001 milliseconds, respectively. 
Adding 1 million 10°-element vectors takes 260 days. 


Lesson learned: For large-scale problems, privacy and 
security must be added with negligible cost. In particular, 
those steps that dominate the computation should not be 
burdened with public-key cryptographic operations (even 
those “efficient” ones such as ECC) simply because they 
have to be performed so many times. This is the major 
principle that guides our design. In our scheme, the main 
computation is always performed in small field, while 
verifications are done via random projection techniques 
to reduce the number of cryptographic operations. As 
our experiments show, this approach is effective. When 
the number of cryptographic operations are insignificant, 
even using the traditional ElGamal encryption (or com- 
mitment) with 1024-bit key the performance is adequate 
for large scale problems. 


4 P4P’s Architecture 


Our approach is called Peers for Privacy, or P4P. The 
name comes from the feature that, during the computa- 
tion, certain aggregate information is released. This is a 
very important technique that allows the private protocol 
to have high efficiency. We show that publishing such 
aggregate information does not harm privacy: individual 
traits are masked out in the aggregates and releasing them 
is safe. In other words, peers data mutually protects each 
other within the aggregates. 

Let k > 1 be a small integer. We assume 
that there are «# servers belonging to differ- 
ent service providers (e.g., Amazon’s EC2  ser- 
vice and Microsoft’s Azure Services Platform, 
http://www.microsoft.com/azure/default.mspx). We 
define a server as all the computation units under the 
control of a single entity. It can be a cluster of thousands 
of machines so that it has the capability to support a 
large number of users. 


Threat Model Let a € |0,0.5) be the upper bound on 
the fraction of the dishonest users in the system. | Our 
scheme is robust against a computationally bounded ad- 
versary whose capability of corrupting parties is mod- 
eled as follows: 


1. The adversary may actively corrupt at most |an| 
users where n is the number of users. 


2. In addition to I, we also allow the same adversary 
to passively corrupt & — 1 server(s). 
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Table 1: Performance Comparison of Existing MPC Implementations 


System Adversary Model Run Time (sec) 


Fairplay (40) [23 
FairplayMP [5| 6.25 
PSSW [46 
LPS 139) 135 


This model was proposed in [21] and is a special 
case of the general adversary structure introduced in 
[28, 34, 35] in that some of the participants are actively 
corrupted while some others are passively corrupted by 
the same adversary at the same time. Our model does not 
satisfy the feasibility requirements of [34, 35] and [28]. 
We avoid the impossibility by considering addition only 
computation. 


The model models realistic threats in our target appli- 
cations. In general, users are not trustworthy. Some may 
be incentivized to bias the computation, some may have 
their machines corrupted. So we model them as active 
adversaries and our protocol ensures that active cheat- 
ing of a small number of users will not exert large in- 
fluence on the computation. This greatly improves over 
existing privacy-preserving data mining solutions (e.g. 
[38, 51, 49]) and many current MPC implementations 
which handle only purely passive adversary. The servers, 
on the other hand, are selling CPU cycles and disk space, 
something that is not related to user’s computation or 
data. Deviating from the protocol causes them penalty 
(e.g., loss of revenue for incorrect results) but little ben- 
efit. Their threat is therefore passive. (Corrupted servers 
are allowed to share data with corrupted users) 


Treating “large institutional” servers as semi-honest, 
non-colluding has already been established by various 
previous work [38, 51, 50, 49]. However, in most of 
the models, the servers are not only semi-honest, but 
also “trusted”, in that some user data is exposed to at 
least one of the servers (vertical or horizontal partitioned 
database). Our model does not have this type of trust re- 
quirement as each server only holds a random share of 
the user data. This further reduces the server’s incentive 
to try to benefit from user data (e.g., reselling it) because 
the information it has are just random numbers without 
the other shares. A compromise requires the collusion 
of all servers which is a much more difficult endeavor. 
This also works for the servers’ benefit: they are relieved 
of the liability of hosting secret or illegal computation, 
a problem that someone [18] envisions cloud providers 
will be facing. 
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5 The P4P Framework 


Let n be the number of users. Let @ be a small (e.g., 32- 
or 64-bit) integer. We write Zg for the additive group 
of integers modulo @. Let a; be private user data for 
user 2 and J be public information. Both can be matri- 
ces of arbitrary dimensions with elements from arbitrary 
domains. Our scheme supports any iterative algorithms 
whose (¢ + 1)-th update can be expressed as 


Me =a) 
t=] 


where qi” = g(ajy,IM) € Z's. is an m-dimensional 
data vector for user 2 computed locally. Typical values 
for both m and n can range from thousands to millions. 
Both f and g are in general non-linear. In the SVD ex- 
ample that we will present, I‘) is the vector returned by 
ARPACK, g is matrix-vector product, and f is the inter- 
nal computation performed by ARPACK. 

This simple primitive is a surprisingly powerful model 
supporting a large number of popular data mining and 
machine learning algorithms, including Linear Regres- 
sion, Naive Bayes, PCA, k-means, ID3, and EM etc., 
as has been demonstrated by numerous previous work 
such as [11, 13, 17, 10, 22]. It has been shown that all 
algorithms in the statistical query model [36] can be ex- 
pressed in this form. Moreover, addition is extremely 
easy to parallelize so aggregating a large amount of num- 
bers on a cluster is straightforward. 


5.1 Private Computation 


In the following we only describe the protocol for one 
iteration since the entire algorithm is simply a sequen- 
tial invocations of the same protocol. The superscript is 
thus dropped from the notation. For simplicity, we only 
describe the protocol for the case of & = 2. It is straight- 
forward to extend it to support & > 2 servers (by sub- 
stituting the (2, 2)-threshold secret sharing scheme with 
a (K,#) one). Using more servers strengthens the pri- 
vacy protection but also incurs additional cost. We do 
not expect the scheme will be used with a large number 
of servers. This arrangement simplifies matters such as 
synchronization and agreement. Let S; and S2 denote 
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the two servers. Leaving out validity and consistency 
check which will be illustrated using the SVD example, 
the basic computation is carried out as follows: 


1. User 2 generates a uniformly random vector u; € 
Ly and computes v; = d; — u; mod @. She sends 
u; to Sy and v; to So. 


2. S; computes uy = S>;_, u; mod ¢ and Sz com- 
putes vy = D>", vu; mod ¢. Sg sends v to S}. 


3. S$ updates I with f((u+v) mod ¢,1). 


It is straightforward to verify that if both servers follow 
the protocol, then the final result is indeed the sum of the 
user data vectors mod @. This result will be correct if 
every user’s vector lies in the specified bounds for L2- 
norm, which is checked by the ZKP in [21]. 


5.2 Provable Privacy 


Theorem 1 P4P’s computation protocol leaks no infor- 
mation beyond the intermediate and final aggregates, if 
no more than & — 1 servers are corrupted. 


The proof follows easily the fact that both the secret shar- 
ing scheme (for the computation) and the Pedersen com- 
mitment scheme [45, 15] used in the ZK protocols are 
information-theoretic private, as the adversary’s view of 
the protocol is uniformly random and contains no infor- 
mation about user data. We refer the readers to [30] for 
details and formal definition of information-theoretic pri- 
vacy. 

As for the leakage caused by the released sums, first, 
for SVD, and some other algorithms, we are able to show 
the sums can be approximated from the final result so 
they do not leak more information. For general compu- 
tation, we draw on the works on differential privacy. [20] 
has shown that, using well-established results from sta- 
tistical database privacy [7, 19, 25], under certain condi- 
tions, releasing the vector sums still maintains differen- 
tial privacy. 

In some situations verifying the conditions of [20] pri- 
vately is non-trivial but this difficulty is not essential in 
our scheme. There are well-established results that prove 
that differential privacy, as well as adequate accuracy, 
can be maintained as long as the sums are perturbed by 
independent noise with variance calibrated to the number 
of iterations and the sensitivity of the function [7, 19, 25]. 
In our settings, it is trivial to introduce noise into our 
framework — each server, which is semi-honest, can add 
the appropriate amount of noise to their partial sums af- 
ter all the vectors from users are aggregated. Calibrating 
noise level is also easy: All one needs are the parameters 
€,0, the total number of queries (mT in our case where 
T’ is the number of iterations), and the sensitivity of the 
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function f, which is summation in our case, defined as 
5): 
S(f) = max || f(D) — f(D) | 


D,D’ 


where D and D’ are two data sets differing by a sin- 
gle record and || - ||; denotes the L1-norm of a vector. 
Cauchy’s Inequality states that 





m m m 
(S- xiys)? < (D527). _ 7?) 
i=1 i=l i=1 
For a user vector a = |[a1,...,@m], let x; = |a;|, y; = 1, 
we have 
m m 
alli = Ds ail)? < 5 a3)m = |lal|gm 
i=l i=1 


Since our framework bounds the L2-norm of a user’s 
vector to below L, this means the sensitivity of the com- 
putation is at most ./mTL. 

Note that the perturbation does not interfere with our 
ZK verification protocols in any way, as the latter is per- 
formed between each user and the servers on the original 
data. Whether noise is necessary or not is dependent on 
the algorithm. For simplicity we will not describe the 
noise process in our protocol explicitly. We stress again 
that the SVD example we will present next does not need 
any noise at all. See section 6.6. 


6 Private Large-Scale SVD 


In the following we use a concrete example, a private 
SVD scheme, to demonstrate how the P4P framework 
can be used to support private computation of popular 
algorithms. 


6.1 Basics 


Recall that for a matrix A € IR”*"’, there exists a factor- 
ization of the form 


A=U>dvt (1) 


where U and V aren X mn and m Xx m, respectively, and 
both have orthonormal columns. » isn xm with nonneg- 
ative real numbers on the diagonal sorted in descending 
order and zeros off the diagonal. Such a factorization is 
called a singular value decomposition of A. The diago- 
nal entries of ©! are called the the singular values of A. 
The columns of U and V are left- resp. right-singular 
vectors for the corresponding singular values. 

SVD is a very powerful technique that forms the core 
of many data mining and machine learning algorithms. 
Let r = rank(A) and u;,v; be the column vectors of 
U and V, respectively. Equation | can be rewritten as 
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A=UDXV* = Yo\_, o;usv; where o; is the ith singu- 
lar value of A. Let k < r be an integer parameter, we 
can approximate A by Ay = UpSnV2 = S73, opal. 
It is known that of all rank-& approximations, A, is op- 
timal in Frobenius norm sense. The & columns of U;, 
(resp. V;,) give the optimal k-dimensional approxima- 
tion to the columnspace (resp. rowspace) of A. This 
dimensionality reduction preserves the structure of orig- 
inal data while considers only essential components of 
the matrix. It usually filters out noise and improves the 
performance of data mining tasks. 

Our implementation uses a popular eigensolver, 
ARPACK [37] (ARnoldi PACKage), and its parallel 
version PARPACK. ARPACK consists of a collection 
of Fortran77 subroutines for solving large-scale eigen- 
value problems. The package implements the Implic- 
itly Restarted Arnoldi Method (IRAM) and allows one 
to compute a few, say k, eigenvalues and eigenvectors 
with user specified features such as those of largest mag- 
nitude. Its storage complexity is nO(k) + O(k*) where 
n is the size of the matrix. ARPACK is a freely-available 
yet powerful tool. It is best suited for applications whose 
matrices are either sparse or not explicitly available: it 
only requires the user code to perform some “action” 
on a vector, supplied by the solver, at every IRAM it- 
eration. This action is simply matrix-vector product in 
our case. Such a reverse communication interface works 
seamlessly with P4P’s aggregation protocol. 


6.2 The Private SVD Scheme 


In our setting the rows of A are distributed across all 
users. We use A;, € R™ to denote the m-dimensional 
row vector owned by user 7. From equation 1, and 
the fact that both U and V are orthonormal, it is clear 
that AA = V?V? which implies that AAV = 
V>?. A straightforward way is then to compute A? A = 
>>, 1 4}, Aix and solve for the eigenpairs of A* A. The 
aggregate can be computed using our private vector ad- 
dition framework. This is a distributed version of the 
method proposed in [7] and does not require the con- 
sistency protocol that we will introduce later. Unfortu- 
nately, this approach is not scalable as the cost for each 
user is O(m?). Suppose m = 10°, and each element 
is a 64-bit integer, then A? Aj, is 8 x 101? bytes, or 
about 8 TB. The communication cost for each user is 
then 16 TB (she must send shares to two servers). This is 
a huge overhead, both communication- and computation- 
wise. Usually the data is very sparse and it is a common 
practice to reduce cost by utilizing the sparsity. Unfor- 
tunately, sparsity does not help in a privacy-respecting 
application: revealing which elements are non-zero is a 
huge privacy breach and the users are forced to use the 
dense format. We propose the following scheme which 
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Figure 1: Private SVD with P4P 


reduces the cost dramatically. We involve the users in 
the iteration and the total communication (and computa- 
tion) cost per iteration is only O(m) for each user. The 
number of iterations required ranges from tens to over 
a thousand. This translates to a maximum of a few GB 
data communicated for each user for the entire protocol 
which is much more manageable. 

One server, say 5, will host an ARPACK engine and 
interact with its reverse communication interface. In 
our case, since A’ A is symmetric, the server will use 
dsaupd, ARPACK’s double precision routine for sym- 
metric problems, and asks for k largest (in magnitude) 
eigenvalues. At each iteration, dsaupd returns a vector v 
to the server code and asks for the matrix-vector product 
A’ Av. Notice that 


A’ Av = s Aj, Ain 


i=1 


Each term in the summation is computable by each user 
locally in O(m) time (by computing the inner product 
A;, + v first) and the result is an m-vector. The vec- 
tor can then be input to the P4P computation which ag- 
gregates them across all users privately. The aggregate 
is the matrix-vector product which can be returned to 
ARPACK for another iteration. This process 1s illustrated 
in figure 1. 

The above method is known to have sensitivity prob- 
lem, i.e., a small perturbation to the input could cause 
large error in the output. In particular, the error is 
O(||A||?/oz) [48]. Fortunately, most applications (e.g., 
PCA) only need the & largest singular values (and 
their singular vectors). It is usually not a problem for 
those applications since for the principal components 
O(||A||?/oz) is small. There is no noticeable inaccuracy 
in our test applications (latent semantic analysis for doc- 
ument retrieval). For general problems the stable way is 
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to compute the eigenpairs of the matrix 


0 AT 
ns, 


It is straightforward to adopt our private vector addition 
framework to compute matrix-vector product with H. 
For simplicity we will not elaborate on this. 


6.3. Enforcing Data Consistency 


During the iteration, user 7 should input d; = Al Aju. 
However, a cheating user could input something com- 
pletely different. This threat is different from inputting 
bogus (but in the allowable range) data at the beginning 
(and using it consistently throughout the iterations). The 
latter only introduces noise to the computation but gener- 
ally does not affect the convergence. The L2-norm ZKP 
introduced in [21], which verifies that the L-2 norm of a 
user’s vector is bounded by a public constant, is effective 
in bounding the noise but does not help in enforcing con- 
sistency. The former, on the other hand, may cause the 
computation not to converge at all. This generally is a 
problem for iterative algorithms and is more than simply 
testing the equality of vectors: The task is complicated 
by the local function that each user uses to evaluate on 
her data, i1.e., she is not simply inputting her private data 
vector, but some (possibly non-linear) function of it. In 
the case of SVD, the system needs to ensure that user 7 
uses the same A;, (to compute d; = A?, A;,v) in all the 
iterations, not that she inputs the same vector. 

We provide a novel zero-knowledge tool that ensures 
that the correct data is used. The protocol is probabilis- 
tic and relies on random projection. That is, the user is 
asked to project her original vector and her result of the 
current round onto some random direction. It then tests 
the relation of the two projections. We will show that this 
method catches cheating with high probability but only 
involves very few expensive large field operations. 


6.3.1 Tools 


The consistency protocol uses some standard crypto- 
graphic primitives. Detailed construction and proofs can 
be found in [45, 15, 11]. We summarize only their key 
properties here. All values used in these primitives lie in 
the multiplicative group Z*, or in the additive group of 
exponents for this group, where g is a 1024 or 2048-bit 
prime. They rely on RSA or discrete log functions for 
cryptographic protection of information. 


e Homomorphic commitment: A homomorphic 
commitment to an integer a with randomness r is 
written as C(a,7r). It is homomorphic in the sense 
that C(a,r)C(b, s) = C(a+b,r+s). Itis infeasible 
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to determine a given C(a,r). We say that a prover 
“opens” the commitment if it reveals a and r. 


e ZKP of knowledge: A prover who knows a and 
r (i.e., who knows how to open A = C(a,r)) can 
demonstrate that it has this knowledge to a verifier 
who knows only the commitment A. The proof re- 
veals nothing about a or r. 


e ZKP for equivalence: Let A = C(a,r) and B = 
C(a,s) be two commitments to the same value a. A 
prover who knows how to open A and 6 can demon- 
strate to a verifier in zero knowledge that they com- 
mit to the same value. 


e ZKP for product: Let A, 6 and C be commitments 
to a, b, c respectively, where c = ab. A prover 
who knows how to open A, 6, C can prove in zero 
knowledge to a verifier who has only the commit- 
ments that the relationship c = ab holds among the 
values they commit to. If say a is made public, this 
primitive can be used to prove that C encodes a num- 
ber that is multiple of a. 


6.3.2 The Protocol 


The consistency check protocol is summarized in the fol- 
lowing. Since the protocol is identical for all users, we 
drop the user subscript for the rest of the paper whenever 
there is no confusion. Let a € Z"” be a user’s original 
vector (i.e., her row in the matrix A). The correct user 
input to this round should be d = a’ av. For two vectors 
x and y, we use x - y to denote their scalar product. 


1. After the user inputs her vector d, in the form of two 
random vectors d“) and d®) in Zs, one to each 


server, s.t. d= d“) + d@) mod @, S; broadcasts a 
random number r. Using r as the seed and a public 
PRG (pseudo-random generator), all players gener- 
ate arandom vector c Er Li 


2. For j € {1,2}, the user computes x¥) = c- aV) 
mod ¢,y4) =a-v mod ¢. Leta = 2 +22), 
y= yD) +y2), z = ay. Let w = (c-a)(a-v) — 
xy. The user commits XY toc, VO to yD, Z 
to z, and W to w. She also construct two ZKPs: 
(1) W encodes a number that is multiple of ¢. (2) 
Z encodes a number that is the product of the two 
numbers encoded in X and Y where X = XO HX) 
and Y = YO) y2), She sends all commitments and 
ZKPs to both servers. 


3. The user opens YY) and YY) to S; who verifies 
that both are computed correctly. Both servers ver- 
ify the ZKPs. If any of them fails, the user is marked 
as FAIL and the servers terminate the protocol with 
her. 
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4. For j € {1,2}, the user computes 24) = c- dV) 
mod ¢, Z = 2) +2 and w = c-d—2Z. She com- 
mits Z to 4), 22) to Z2), and W to w. She 
constructs the following two ZKPs: (1) W encodes 
a number that is multiple of ¢ and (2)ZW and ZW 
encode the same value. She sends all the commit- 
ments and ZKPs to both servers. 


5. The user opens Z9) to S; who verifies that it is 
computed correctly. Both servers verify the two 
ZKPs. They mark the user as FAIL if any of the 
verifications fails and terminate the protocol with 
her. 


6. Both servers output PASS. 


Group Sizes 

There are three groups/fields involved in the protocol: 
the large, multiplicative group Z; used for commitments 
and ZKPs, the “small” group Zg used for additive secret- 
sharing, and the group of all integers. All the commit- 
ments suchas XY) and Y) are computed in Ly so stan- 
dard cryptographic tools can be used. The inputs to the 
commitments, which can be user’s data or some inter- 
mediate results, are either in Zy or in the integer group 
(without bounding their values). Restricting commit- 
ment inputs to small field/group does not compromise 
the security of the scheme since the outputs are still in 
the large field. Using Pederson’s commitment as an ex- 
ample, the hiding property is guaranteed by the random 
numbers that are generated in the large field for each 
commitments. And breaking the binding property is still 
equivalent to solving the discrete logarithm problem in 
Ly: See [45]. 

The protocol makes it explicit which group a number 
is in using the mod ¢ operator (i.e., x = g(y) mod @ 
restricts x to be in Zg while x = g(y) means x can be 
in the whole integer range). The protocol assumes that 
q >> o. This ensures that the numbers that are in the 
integer group (x, y, z, w in step 2 and z and w in step 4) 
are much less than g to avoid modular reduction when 
their commitments are produced. This is true for most 
realistic deployment, since ¢@ is typically 64 bits or less 
while g is 1024 bits or more. Theorem 2 proves that the 
transition from Zy to integer fields and Z, only causes 
the protocol to fail with extremely low probability: 


Theorem 2 Let O be the output of the Consistency 
Check protocol. Then 


Pr(O = PASS|d = a‘ av) = 1 


and 
Pr(O = PASS|d 4 a‘ av) < 


Sle 


Furthermore, the protocol is zero-knowledge. 
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Proof If computed correctly, both w and w are multiples 
of ¢ due to modular reduction. Because of homomor- 
phism, the equivalence ZKP that ZW and ZW encode 
the same value is to verify that c- d = c- (a* av). 

Completeness: If the user performs the computation 
correctly, she should input d = a’ av into this round 
of computation. All the verifications should pass. The 
protocol outputs PASS with probability 1. 

Soundness: Suppose d 4 aa‘ v. The user is forced 
to compute the commitments ¥), ¥), YO YO), and 
2), Z) faithfully since she has to open them to at 
least to one of the servers. The product ZKP at step 2 
forces the number encoded in Z to be xy which differs 
from c - (a‘ av) by w. Due to homomorphism, at step 
4, Z encodes a number that differs from c - d by w. The 
user could cheat by lying about w or w, 1.e., she could 
encode some other values in W and W to adjust for the 
difference between c - d and c- (a* av), hoping to pass 
the equivalence ZKP. However, assuming the soundness 
of the ZKPs used, the protocol forces both to be multiple 
of @ (steps 2 and 4), so she could succeed only when the 
difference between c- d, which she actually inputs to this 
round, and ¢ - (a* av), which she should input, is some 
multiple of ¢. Since c is made known to her after she 
inputs d, the two numbers are totally unpredictable and 
random to her. The probability that c-d—c-(a* av) isa 
multiple of ¢ is only 1/¢ which is the probability of her 
SUCCESS. 

Finally, the protocol consists of a sequential invoca- 
tion of some well-established ZKPs. By the sequential 
composition theorem of [32], the whole protocol is also 
zero-knowledge. 

As a side note, all the ZKPs can be made non- 
interactive using the Fiat-Shamir paradigm [27]. The 
user could upload her data in a batch without further in- 
teraction. This makes it easier to deploy the scheme. It 
is also much more light-weight than the L2-norm ZKP 
[21]: the number of large field operations is constant, as 
opposed to O(log m) in the L2-norm ZKP. The private 
SVD computation thus involves only one L2-norm ZKP 
at first round, and one light verification for each of the 
subsequent rounds. 


6.4 Dealing with Real Numbers 


In their simplest forms, the cryptographic tools only sup- 
port computation on integers. In most domains, however, 
applications typically have to handle real numbers. In 
the case of SVD, even if the original input matrix con- 
tains only integer entries, it is likely that real numbers 
appear in the intermediate (e.g., the vectors returned by 
ARPACK) and the final results. 

Because of the linearity of the P4P computation, we 
can use a simple linear digitization scheme to convert 
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between real numbers in the application domain and Z4, 
P4P’s integer field. Let R > O be the bound of the 
maximum absolute value application data can take, 1.e., 
all numbers produced by the application are between 
|—R, R]. The integer field provides || bits resolution. 
This means the maximum quantization error for one vari- 
able is R/¢ = Q1FI— lol, Summing across all n users, the 
worst case absolute error is bounded by n2/¥!-!¢l, In 
practice |@| can be 64, and |R| can be around e.g., 20 
(this gives a range of [—27°, 27°]). With n = 10°, this 
gives a maximum absolute error of under 1 over a mil- 
lion. 


6.5 The Protocol 


Let Q be the set of qualified users initialized to the set of 
all users. The entire private SVD method is summarized 
as follows: 


1. Input The user first provides an L2-norm ZKP [21] 
on a with a bound J, i.e., she submits a ZKP that 
|a|l2 < L. This step also forces the user to commit 
to the vector a. Specifically, at the end of this step, 
S; and Sz have a“) € Zy and a?) € Zg, respec- 
tively, such that a = a“)+a') mod @. Users who 
fail this ZKP are excluded from subsequent compu- 
tation. 


2. Repeat the following steps until the ARPACK rou- 
tine indicates convergence or stops after certain 
number of iterations: 


(a) Consistency Check When dsaupd returns 
control to S; with a vector, the server con- 
verts the vector to v € Z% and sends it to 
all users. The servers execute the consistency 
check protocol for each user. 


(b) Aggregate For any users who are marked as 
FAIL, or fail to respond, the servers simply ig- 
nore their data and exclude them from subse- 
quent computation. Q is updated accordingly. 
For this round they compute s = )/, cq & and 
S; returns it as the matrix-vector product to 
dsaupd which runs another iteration. 


3. Output S; outputs 


ae = diag(o1,02,...,0%) € R*** 
Vie itiac te. eRe 


with o; = ./A; where ); is the ith eigenvalue 
and uv; the corresponding eigenvector computed by 
ARPACK, 7 = 1,...,k, and Ay Pe Docks = Nis 


For accuracy of the result produced by this protocol in 
the presence of actively cheating users, we have 
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Theorem 3 Let n, be the number of cheating users. We 
use - to denote perturbed quantity and o;, the i-th singu- 
lar value of matrix A. Assuming that honest users vector 
L2-norms are uniformly random in [0,L) and ne < n, 
then 





Proof The classic Weyl and Mirsky theorems [47] bound 
the perturbation to A’s singular values in terms of the 


~ 


Frobenius norm || - || 7 of £ := A — A: 


SS (& — 0%)? < ||Ellr 


a 


In our case each row a, of A is held by a user, we have 





7m 


Elle = | > lla — all 


t=1 


Since the protocol ensures that ||a;||2 < L for all users, 


SS (& — 0%) 


a 


Let € = \/9°, (6; — 01)? /./>_, a7, and assuming that 
honest users vector L2-norms are uniformly random in 
(0, L) andn. <n, then 


YG, =o,)- \/NeL Ne 


~~ 


rm 
2<,)>_ |las — all < /nek 
i=l 





CS 


2 
Alle 0.5,/(n — n-)L n 


The scheme is also quite robust against users failures. 
During our tests reported in section 7, we simulated a 
fraction of random users “dropping out” of each itera- 
tion. Even when up to 50% of the users dropped, for all 
our test sets, the computation still converged without no- 
ticeable loss of accuracy, measured by residual error (see 
section 7.1) using the final matrix with failed users data 
ignored. This allows us to handle malicious users who 
actively try to disrupt the computation and those who fail 
to response due to technical problems (e.g., network fail- 
ure) in a uniform way. 


6.6 Privacy Analysis 


Note that the protocol does not compute U;,. This is in- 
tentional. U; contains information about user data: the 
it” row of U; encodes user 7’s data in the k-dimensional 
subspace and should not be revealed at all in a privacy- 
respecting application. V,, on the other hand, encodes 
“item” data in the k-dimensional subspace (e.g., if A is a 
user-by-movie rating matrix, the items will be movies). 
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In most applications the desired information can be com- 
puted from the singular values (%;,) and the right singular 
vectors (V,’ ) (e.g., [11]) 

At each iteration, the protocol reveals the matrix- 
vector product A’ Av for some vectors v. This is not 
a problem because the final results ©; and V,’ already 
give an approximation of A’ A (A*, A = Vd?V"‘). A 
simulator with the final results can approximate the in- 
termediate sums. Therefore the intermediate aggregates 
do not reveal more information. 


7 Implementation and Evaluation 


The P4P framework, including the SVD protocol, has 
been implemented in Java using JNI and a NativeBig- 
Integer implementation from I2P (http://www.i2p2.de/). 
We run several experiments. The server is a 2.50GHz 
Xeon E5420 with 32GB memory, the clients are 
2.00GHz Xeon E5405 with 800 MB memory allocated 
to the tests. In all the experiments, ¢ is set to be a 62-bit 
integer and g 1024-bit. 

We evaluated our implementation on three data sets: 
the Enron Email Data set [14], EachMovie (EM), and a 
randomly generated dense matrix (RAND). The Enron 
corpus contains email data from 150 users, spanning a 
period of about 5 years (Jan. 1998 to Dec 2002). Our test 
was run on the social graph defined by the email commu- 
nications. The graph is represented as a 150 x 150 ma- 
trix A with A(z, 7) being the number of emails sent by 
user 7 to user 7. EachMovie is a well-known test data set 
for collaborative filtering. It comprises ratings of 1648 
movies by 74424 users. Each rating is a number in the 
range [0, 1]. Both the Enron and EachMovie data sets are 
very sparse, with densities 0.0736 and 0.0229, respec- 
tively. To test the performance of our protocol on dense 
matrices, we generated randomly a 2000 x 2000 matrix 
with entries chosen in the range [—27°, 27°]. 


7.1 Precision and Round Complexity 


We measured two quantities: NV, the number of IRAM it- 
erations until ARPACK indicates convergence, and ¢, the 
relative error. NV is the number of matrix-vector compu- 
tation that was required for the ARPACK to converge. 
It is also the number of times P4P aggregation is in- 
voked. The error € measures the maximum relative resid- 
ual norm among all eigenpairs computed: 


|| A? Av; mez Avi ||2 
Ee=>= maxX ———— 
(hy aaa Ale 


Table 2 summarizes the results. In all these tests, 
we used machine precision as the tolerance input to 
ARPACK. The accuracy we obtained is very good: € re- 
mains very small for all tests (10~!* to 1078). In terms 
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of round complexity, NV ranges from under 100 to a few 
hundreds. For comparison, we also measured the num- 
ber of iterations required by ARPACK when we perform 
the matrix-vector multiplication directly without the P4P 
aggregation. In all experiments, we found no difference 
in N between this direct method and our private imple- 
mentation. 


7.2 Performance 


We measured both running time and communication cost 
of our scheme. We focused on server load since each user 
only needs to handle her own data so is not a bottleneck. 
We first present the case with & = 2 servers. We mea- 
sured the work on the server hosting the ARPACK engine 
since it shares more load. 

First, the implementation confirmed our observations 
about the difference in costs for manipulating large and 
small integers. With 1024-bit key length, one exponenti- 
ation within the multiplicative group Z; takes 5.86 mil- 
liseconds. Addition and multiplication of two numbers, 
also within the group, take 0.024 and 0.062 millisec- 
onds, respectively. In contrast, adding two 64-bit inte- 
gers, which is the basic operations P4P framework per- 
forms, needs only 2.7 x 10~° milliseconds. The product 
ZKP takes 35.7 ms verifier time and 24.3 ms prover time. 
The equivalence ZKP takes no time since it is simply re- 
vealing the difference of the two random numbers used in 
the commitments [45]. For each consistency check, the 
user needs to compute 9 commitments, 3 product ZKPs, 
1 equivalence ZKP and 4 large integer multiplications. 
The total cost is 178.63 milliseconds for each user. For 
every user, each server needs to spend 212.83 millisec- 
onds on verification. 

For our test data sets, it takes 74.73 seconds of server 
time to validate and aggregate all 150 Enron users data 
on a single machine (each user needs to spend 726 mil- 
liseconds to prepare the zero-knowledge proofs). This 
translates into a total of 5000 seconds or 83 minutes 
spent on private P4P aggregation to compute k = 10 
singular-pairs. To compute the same number of singu- 
lar pairs for EachMovie, aggregating all users data takes 
about 6 hours (again on a single machine) and the to- 
tal time for 70 rounds is 420 hours. Note that the total 
includes both verification and computation so it is the 
cost of a complete run. The server load appears large 
but actually is very inexpensive. The aggregation pro- 
cess is trivially parallelizable and using a cluster of, say 
200 nodes, will reduce the running time to about 2 hours. 
This amounts to a very insignificant cost for most service 
providers: Using Amazon EC2’s price as a benchmark, it 
costs $0.80 per hour for 20 EC2 Compute Units (8 virtual 
cores with 2.5 EC2 Compute Units each). Data trans- 
fer price is $0.100 per GB. The total cost for comput- 
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Table 2: Round Complexity and Precision 
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ing SVD for a system with 74424 users is merely about 
$15, including data transfer and adjusted for difference 
in CPU performance between our experiments and EC2. 

To compare with alternative solutions, we imple- 
mented a method based on homomorphic encryption 
which is a popular private data mining technique (see 
e.g., [11, 51]). We did not try other methods, such as 
the “‘add/subtract random” approach, with players adding 
their values to a running total, because they do not al- 
low for verification of user data thus are insecure in our 
model. We tested both ElGamal and Paillier encryptions 
with the same security parameter as our P4P experiments 
(.e., 1024-bit key). With the homomorphic encryption 
approach, it is almost impossible to execute the ZK ver- 
ification (although there is a protocol [11]) as it takes 
hours to verify one user. So we only compared the time 
needed for computing the aggregates. Figure 2 shows the 
ratios of running time between homomorphic encryption 
and P4P for SVD on the three data sets. P4P is at least 
8 orders of magnitude faster in all cases for both ElGa- 
mal and Paillier. And this translates to tens of millions of 
dollars of cost for the homomorphic encryption schemes 
if the computation is done using Amazon’s EC2 service 
not even counting data transfer expenses. 

The communication overhead is also very small since 
the protocol passes very few large integers. The extra 
communication per client for one L2-norm ZKP is un- 
der 50 kilobytes, and under 100 bytes for the consistency 
check, while other solutions require some hundreds of 
megabytes. This is significantly smaller than the size of 
an average web page. The additional workload for the 
server is less than serving an extra page to each user. 


The case with «& > 2 servers: Although we do not ex- 
pect the scheme to be deployed with a large number of 
servers, we provide some analysis here in case stronger 
protection is required. Each server’s work can be divided 
into two parts: processing clients and communicating 
with other servers. Most expensive interactions are with 
the clients (including verifying the ZKPs etc.), which can 
be performed on a single server and is independent of k. 
The interaction among servers is simply data exchange 
and there is no complex computation involved. 

Data exchange among the servers serves two purposes: 
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Figure 2: Running time ratios between homomorphic en- 
cryption based solutions and P4P. 


reconstructing shared secrets when necessary (the final 
sum in the end of each iteration and the commitments 
during the verification) and reaching agreement regard- 
ing a user’s status (each server needs to verify that the 
user computes a share of the commitments correctly). 
And since each server is semi-honest, for the second part 
they only need pass the final conclusion, verification of 
the ZKPs can be done on only one of the servers. 


For constructing the final sum, all servers must send 
their shares to the server hosting ARPACK. The later 
will receive a total of 8km bytes (assuming data is en- 
coded using double precision) which is about 8K MB if 
m = 10°. For the consistency check, during each it- 
eration, one server is selected as the “master”. All other 
servers sends their shares of the commitments to the mas- 
ter. This includes 3n large integers in Z, (3 for each 
user) from each server. In addition, each non-master 
server also sends to the master an n-bit bitmap, encod- 
ing whether each user computes the commitments to the 
shares correctly. The master will reconstruct the com- 
plete commitments and verify the ZKPs. It then broad- 
casts an n-bit bitmap encoding whether each user passes 
the consistency check to all other servers. For the mas- 
ter, the total communication cost is receiving 3n(K — 1) 
integers in Z, and «n-bit strings and sending (K — 1)n 
bits. With n = 10° and |q| = 1024, these amount to 
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384 (« — 1) MB and approximately 0.1 (« — 1) MB, re- 
spectively. For other servers, the sending and receiving 
costs are approximately 384 MB and 0.1 MB, respec- 
tively. We believe such cost is practical for small « (e.g., 
3 or 4). Note that the master does not have to be collo- 
cated with the ARPACK engine so the servers can take 
turns to serve as the master to share the load. 


As for the computation associated with using & servers 
(the part that is independent of « has been discussed 
earlier and omitted here), the master needs to perform 
3n(« — 1) multiplications in Z;. Using our benchmark, 
this amounts to 0.186(« — 1) seconds for n = 10° users. 
Again we believe this is practical for small «. The other 
servers do not need to do any extra work. 


7.3. Scalability 


We also experimented with a few very large matrices, 
with dimensionality ranging from tens of thousands to 
over a hundred million. They are document-term or user- 
query matrices that are used for latent semantic analysis. 
To facilitate the tests, we did not include the data ver- 
ification ZKPs, as our previous benchmarks show they 
amount to an insignificant fraction of the cost. Due to 
space and resource limit we did not test how performance 
varies with dimensionality and other parameters. Rather, 
these results are meant to demonstrate the capability of 
our system, which we have shown to maintain privacy at 
very low cost, to handle large data sets at various config- 
urations. 


Table 3 summarizes some of the results. The running 
time measures the time of a complete run, i.e., from the 
start of the job till the results are safely written to disk. 
It includes both the computation time of the server (in- 
cluding the time spent on invoking the ARPACK engine) 
and the clients (which are running in parallel), and the 
communication time. In the table, frontend processors 
refer to the machines that interact with the users directly. 
Large-scale systems usually use multiple frontend ma- 
chines, each serving a subset of the users. This is also a 
straightforward way to parallelize the aggregation pro- 
cess, 1.e., each frontend machine receives data from a 
subset of users and aggregates them before forwarding 
to the server. On one hand, the more frontend machines 
the faster the sub-aggregates can be computed. On the 
other hand, the server’s communication cost is linear in 
the number of frontend processors. The optimal solution 
must strike a balance between the two. Due to resource 
limitation, we were not able to use the optimal configu- 
ration for all our tests. The results are feasible even in 
these sub-optimal cases. 
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$8 Conclusion 


In this paper we present a new framework for privacy- 
preserving distributed data mining. Our protocol is based 
on secret sharing over small field, achieving orders of 
magnitude reduction in running time over alternative so- 
lutions with large-scale data. The framework also admits 
very efficient zero-knowledge tools that can be used to 
verify user data. They provide practical solutions for 
handling cheating users. P4P demonstrates that cryp- 
tographic building blocks can work harmoniously with 
existing tools, providing privacy without degrading their 
efficiency. Most components described in this paper 
have been implemented and the source code is avail- 
able at http://bid.berkeley.edu/projects/p4p/. Our goal is 
to make it a useful tool for developers in data mining 
and others to build privacy preserving real-world appli- 
cations. 
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Notes 


'Most mining algorithms need to bound the amount of noise in the 
data to produce meaningful results. This means that the fraction of 
cheating users is usually below a much lower threshold (e.g. a < 
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Abstract 


Secure multiparty computation (MPC) allows joint 
privacy-preserving computations on data of multiple par- 
ties. Although MPC has been studied substantially, 
building solutions that are practical in terms of compu- 
tation and communication cost is still a major challenge. 
In this paper, we investigate the practical usefulness of 
MPC for multi-domain network security and monitor- 
ing. We first optimize MPC comparison operations for 
processing high volume data in near real-time. We then 
design privacy-preserving protocols for event correlation 
and aggregation of network traffic statistics, such as ad- 
dition of volume metrics, computation of feature entropy, 
and distinct item count. Optimizing performance of par- 
allel invocations, we implement our protocols along with 
a complete set of basic operations in a library called 
SEPIA. We evaluate the running time and bandwidth re- 
quirements of our protocols in realistic settings on a lo- 
cal cluster as well as on PlanetLab and show that they 
work in near real-time for up to 140 input providers and 
9 computation nodes. Compared to implementations us- 
ing existing general-purpose MPC frameworks, our pro- 
tocols are significantly faster, requiring, for example, 3 
minutes for a task that takes 2 days with general-purpose 
frameworks. This improvement paves the way for new 
applications of MPC in the area of networking. Finally, 
we run SEPIA’s protocols on real traffic traces of 17 net- 
works and show how they provide new possibilities for 
distributed troubleshooting and early anomaly detection. 


1 Introduction 


A number of network security and monitoring prob- 
lems can substantially benefit if a group of involved or- 
ganizations aggregates private data to jointly perform a 
computation. For example, IDS alert correlation, e.g., 
with DOMINO [49], requires the joint analysis of pri- 
vate alerts. Similary, aggregation of private data is useful 
for alert signature extraction [30], collaborative anomaly 
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detection [34], multi-domain traffic engineering [27], de- 
tecting traffic discrimination [45], and collecting net- 
work performance statistics [42]. All these approaches 
use either a trusted third party, e.g., a university research 
group, or peer-to-peer techniques for data aggregation 
and face a delicate privacy versus utility tradeoff [32]. 
Some private data typically have to be revealed, which 
impedes privacy and prohibits the acquisition of many 
data providers, while data anonymization, used to re- 
move sensitive information, complicates or even pro- 
hibits developing good solutions. Moreover, the ability 
of anonymization techniques to effectively protect pri- 
vacy is questioned by recent studies [29]. One possible 
solution to this privacy-utility tradeoff is MPC. 


For almost thirty years, MPC [48] techniques have 
been studied for solving the problem of jointly running 
computations on data distributed among multiple orga- 
nizations, while provably preserving data privacy with- 
out relying on a trusted third party. In theory, any com- 
putable function on a distributed dataset is also securely 
computable using MPC techniques [20]. However, de- 
signing solutions that are practical in terms of running 
time and communication overhead is non-trivial. For this 
reason, MPC techniques have mainly attracted theoreti- 
cal interest in the last decades. Recently, optimized ba- 
Sic primitives, such as comparisons [14, 28], make pro- 
gressively possible the use of MPC in real-world applica- 
tions, e.g., an actual sugar-beet auction [7] was demon- 
strated in 2009. 


Adopting MPC techniques to network monitoring and 
security problems introduces the important challenge of 
dealing with voluminous input data that require online 
processing. For example, anomaly detection techniques 
typically require the online generation of traffic volume 
and distributions over port numbers or IP address ranges. 
Such input data impose stricter requirements on the per- 
formance of MPC protocols than, for example, the in- 
put bids of a distributed MPC auction [7]. In particular, 
network monitoring protocols should process potentially 
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Figure 1: Deployment scenario for SEPIA. 


thousands of input values while meeting near real-time 
guarantees!. This is not presently possible with existing 
general-purpose MPC frameworks. 

In this work, we design, implement, and evaluate 
SEPIA (Security through Private Information Aggrega- 
tion), a library for efficiently aggregating multi-domain 
network data using MPC. The foundation of SEPIA is 
a set of optimized MPC operations, implemented with 
performance of parallel execution in mind. By not en- 
forcing protocols to run in a constant number of rounds, 
we are able to design MPC comparison operations that 
require up to 80 times less distributed multiplications 
and, amortized over many parallel invocations, run much 
faster than constant-round alternatives. On top of these 
comparison operations, we design and implement novel 
MPC protocols tailored for network security and moni- 
toring applications. The event correlation protocol iden- 
tifies events, such as IDS or firewall alerts, that occur 
frequently in multiple domains. The protocol is generic 
having several applications, for example, in alert corre- 
lation for early exploit detection or in identification of 
multi-domain network traffic heavy-hitters. In addition, 
we introduce SEPIA’s entropy and distinct count proto- 
cols that compute the entropy of traffic feature distribu- 
tions and find the count of distinct feature values, respec- 
tively. These metrics are used frequently in traffic anal- 
ysis applications. In particular, the entropy of feature 
distributions is used commonly in anomaly detection, 
whereas distinct count metrics are important for identify- 
ing scanning attacks, in firewalls, and for anomaly detec- 
tion. We implement these protocols along with a vector 
addition protocol to support additive operations on time- 
series and histograms. 

A typical setup for SEPIA is depicted in Fig. 1 where 
individual networks are represented by one input peer 
each. The input peers distribute shares of secret input 
data among a (usually smaller) set of privacy peers us- 
ing Shamir’s secret sharing scheme [40]. The privacy 
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peers perform the actual computation and can be hosted 
by a subset of the networks running input peers but also 
by external parties. Finally, the aggregate computation 
result is sent back to the networks. We adopt the semi- 
honest adversary model, hence privacy of local input data 
is guaranteed as long as the majority of privacy peers is 
honest. A detailed description of our security assump- 
tions and a discussion of their implications is presented 
in Section 4. 


Our evaluation of SEPIA’s performance shows that 
SEPIA runs in near real-time even with 140 input and 
9 privacy peers. Moreover, we run SEPIA on traffic data 
of 17 networks collected during the global Skype out- 
age in August 2007 and show how the networks can use 
SEPIA to troubleshoot and timely detect such anomalies. 
Finally, we discuss novel applications in network secu- 
rity and monitoring that SEPIA enables. In summary, 
this paper makes the following contributions: 


1. We introduce efficient MPC comparison operations, 
which outperform constant-round alternatives for 
many parallel invocations. 

2. We design novel MPC protocols for event correla- 
tion, entropy and distinct count computation. 

3. We introduce the SEPIA library, in which we im- 
plement our protocols along with a complete set of 
basic operations, optimized for parallel execution. 
SEPIA is made publicly available [39]. 

4. We extensively evaluate the performance of SEPIA 
on realistic settings using synthetic and real traces 
and show that it meets near real-time guarantees 
even with 140 input and 9 privacy peers. 

5. We run SEPIA on traffic from 17 networks and 
show how it can be used to troubleshoot and timely 
detect anomalies, exemplified by the Skype outage. 


The paper is organized as follows: We specify the 
computation scheme in the next section and present our 
optimized comparison operations in Section 3. In Sec- 
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tion 4, we specify our adversary model and security as- 
sumptions, and build the protocols for event correlation, 
vector addition, entropy, and distinct count computation. 
We evaluate the protocols and discuss SEPIA’s design in 
Sections 5 and 6, respectively. Then, in Section 7 we 
outline SEPIA’s applications and conduct a case study 
on real network data that demonstrates SEPIA’s benefits 
in distributed troubleshooting and early anomaly detec- 
tion. Finally, we discuss related work in Section 8 and 
conclude our paper in Section 9. 


2 Preliminaries 


Our implementation is based on Shamir secret shar- 
ing [40]. In order to share a secret value s among a Set of 
m players, the dealer generates a random polynomial f 
of degree t = |(m-—1)/2]| over a prime field Z, with 
p > s, such that f(0) = s. Each player i = 1...m then 
receives an evaluation point s; = f(2) of f. s; is called 
the share of player 2. The secret s can be reconstructed 
from any t + 1 shares using Lagrange interpolation but 
is completely undefined for ¢ or less shares. To actually 
reconstruct a secret, each player sends his shares to all 
other players. Each player then locally interpolates the 
secret. For simplicity of presentation, we use [s] to de- 
note the vector of shares (51,..., 5) and call it a shar- 
ing of s. In addition, we use |[s]; to refer to s;. Unless 
stated otherwise, we choose p with 62 bits such that arith- 
metic operations on secrets and shares can be performed 
by CPU instructions directly, not requiring software al- 
gorithms to handle big integers. 


Addition and Multiplication Given two sharings [a] 
and |b], we can perform private addition and multiplica- 
tion of the two values a and 6. Because Shamir’s scheme 
is linear, addition of two sharings, denoted by |a] + [6], 
can be computed by having each player locally add his 
shares of the two values: [a + b]; = [a]; + [b];. Sim- 
ilarly, local shares are subtracted to get a share of the 
difference. To add a public constant c to a sharing |al, 
denoted by [a] + c, each player just adds c to his share, 
i.e., [a+c]; = [a]; +c. Similarly, for multiplying [a] by a 
public constant c, denoted by c[a], each player multiplies 
its share by c. Multiplication of two sharings requires an 
extra round of communication to guarantee randomness 
and to correct the degree of the new polynomial [4, 19]. 
In particular, to compute [a||b] = [ab], each player first 
computes d; = |a];|b]; locally. He then shares d; to get 
|d;|. Together, the players then perform a distributed La- 
grange interpolation to compute [ab] = )°, A;[d;| where 
A; are the Lagrange coefficients. Thus, a distributed 
multiplication requires a synchronization round with m? 
messages, as each player 7 sends to each player 7 the 
share |d;|;. To specify protocols, composed of basic op- 
erations, we use a shorthand notation. For instance, we 
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write foo({a|,b) := ({a] + b)({a] + b), where foo is the 
protocol name, followed by input parameters. Valid in- 
put parameters are sharings and public constants. On the 
right side, the function to be computed is given, a bino- 
mial in that case. The output of foo is again a sharing 
and can be used in subsequent computations. All opera- 
tions in Z, are performed modulo p, therefore p must be 
large enough to avoid modular reductions of intermedi- 
ate results, e.g., if we compute [ab] = |a][b], then a, b, 
and ab must be smaller than p. 


Communication A set of independent multiplications, 
e.g., [ab] and [cd], can be performed in parallel in a sin- 
gle round. That is, intermediate results of all multipli- 
cations are exchanged in a single synchronization step. 
A round simply is a synchronization point where players 
have to exchange intermediate results in order to con- 
tinue computation. While the specification of the proto- 
cols is synchronous, we do not assume the network to 
be synchronous during runtime. In particular, the Inter- 
net is better modeled as asynchronous, not guaranteeing 
the delivery of a message before a certain time. Be- 
cause we assume the semi-honest model, we only have 
to protect against high delays of individual messages, 
potentially leading to a reordering of message arrival. 
In practice, we implement communication channels us- 
ing SSL sockets over TCP/IP. TCP applies acknowledg- 
ments, timeouts, and sequence numbers to preserve mes- 
sage ordering and to retransmit lost messages, providing 
FIFO channel semantics. We implement message syn- 
chronization in parallel threads to minimize waiting time. 
Each player proceeds to the next round immediately after 
sending and receiving all intermediate values. 


Security Properties All the protocols we devise are 
compositions of the above introduced addition and mul- 
tiplication primitives, which were proven correct and 
information-theoretically secure by Ben-Or, Goldwasser, 
and Wigderson [4]. In particular, they showed that in the 
semi-honest model, where adversarial players follow the 
protocol but try to learn as much as possible by sharing 
the information they received, no set of ¢ or less corrupt 
players gets any additional information other than the fi- 
nal function value. Also, these primitives are universally 
composable, that is, the security properties remain in- 
tact under stand-alone and concurrent composition [11]. 
Because the scheme is information-theoretically secure, 
1e., It is Secure against computationally unbounded ad- 
versaries, the confidentiality of secrets does not depend 
on the field size p. For instance, regarding confidential- 
ity, sharing a secret s in a field of size p > s is equivalent 
to sharing each individual bit of s in a field of size p = 2. 
Because we use SSL for implementing secure channels, 
the overall system relies on PKI and 1s only computation- 
ally secure. 
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3 Optimized Comparison Operations 


Unlike addition and multiplication, comparison of two 
shared secrets is a very expensive operation. There- 
fore, we now devise optimized protocols for equality 
check, less-than comparison and a short range check. 
The complexity of an MPC protocol is typically assessed 
counting the number of distributed multiplications and 
rounds, because addition and multiplication with pub- 
lic values only require local computation. Damgard 
et al. introduced the bit-decomposition protocol [14] 
that achieves comparison by decomposing shared se- 
crets into a shared bit-wise representation. On shares 
of individual bits, comparison is straight-forward. With 
1 = log,(p), the protocols in [14] achieve a comparison 
with 205/ + 188/ log, / multiplications in 44 rounds and 
equality test with 98/ + 94/ log, / multiplications in 39 
rounds. Subsequently, Nishide and Ohta [28] have im- 
proved these protocols by not decomposing the secrets 
but using bitwise shared random numbers. They do com- 
parison with 279/ + 5 multiplications in 15 rounds and 
equality test with 81/ multiplications in 8 rounds. While 
these are constant-round protocols as preferred in theo- 
retical research, they still involve lots of multiplications. 
For instance, an equality check of two shared IPv4 ad- 
dresses ({ = 32) with the protocols in [28] requires 2592 
distributed multiplications, each triggering m? messages 
to be transmitted over the network. 


Constant-round vs. number of multiplications Our 
key observation for improving efficiency is the follow- 
ing: For scenarios with many parallel protocol invoca- 
tions it is possible to build much more practical protocols 
by not enforcing the constant-round property. Constant- 
round means that the number of rounds does not depend 
on the input parameters. We design protocols that run 
in O(/) rounds and are therefore not constant-round, al- 
though, once the field size p is defined, the number of 
rounds is also fixed, 1.e., not varying at runtime. The 
overall local running time of a protocol is determined by 
1) the local CPU time spent on computations, 11) the time 
to transfer intermediate values over the network, and 111) 
delay experienced during synchronization. Designing 
constant-round protocols aims at reducing the impact of 
11) by keeping the number of rounds fixed and usually 
small. To achieve this, high multiplicative constants for 
the number of multiplications are often accepted (e.g., 
2791). Yet, both 1) and ii) directly depend on the num- 
ber of multiplications. For applications with few parallel 
operations, protocols with few rounds (usually constant- 
round) are certainly faster. However, with many paral- 
lel operations, as required by our scenarios, the impact 
of network delay is amortized and the number of multi- 
plications (the actual workload) becomes the dominating 
factor. Our evaluation results in Section 5.1 and 5.4 con- 
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firm this and show that CPU time and network bandwidth 
are the main constraining factors, calling for a reduction 
of multiplications. 


Equality Test In the field Z, with p prime, Fermat’s lit- 
tle theorem states 


— )1 ifc40 


Using (1) we define a protocol for equality test as fol- 
lows: 


equal ({a, [b]) := 1 — ({a] — [b])?~* 


The output of equal is [1] in case of equality and |0] oth- 
erwise and can hence be used in subsequent computa- 
tions. Using square-and-multiply for the exponentiation, 
we implement equal with / + k — 2 multiplications in / 
rounds, where k denotes the number of bits set to 1 in 
p —1. By using carefully picked prime numbers with 
k; < 3, we reduce the number of multiplications to / + 1. 
In the above example for comparing IPv4 addresses, this 
reduces the multiplication count by a factor of 76 from 
2592 to 34. 

Besides having few 1-bits, p must be bigger than the 
range of shared secrets, 1.e., if 32-bit integers are shared, 
an appropriate p will have at least 33 bits. For any secret 
size below 64 bits it is easy to find appropriate ps with 
k < 3 within 3 additional bits. 


Less Than For less-than comparison, we base our im- 
plementation on Nishide’s protocol [28]. However, we 
apply modifications to again reduce the overall number 
of required multiplications by more than a factor of 10. 
Nishide’s protocol is quite comprehensive and built on a 
stack of subprotocols for least-significant bit extraction 
(LSB), operations on bitwise-shared secrets, and (bit- 
wise) random number sharing. The protocol uses the ob- 
servation that a < bis determined by the three predicates 
a < p/2,b < p/2, and a — b < p/2. Each predicate is 
computed by a call of the LSB protocol for 2a, 26, and 
2(a — b). If a < p/2, no wrap-around modulo p occurs 
when computing 2a, hence LS B(2a) = 0. However, if 
a > p/2, a wrap-around will occur and LS B(2a) = 1. 
Knowing one of the predicates in advance, e.g., because 
b is not secret but publicly known, saves one of the three 
LSB calls and hence 1/3 of the multiplications. 

Due to space restrictions we omit to reproduce the 
entire protocol but focus on the modifications we ap- 
ply. An important subprotocol in Nishide’s construc- 
tion is PrefixOr. Given a sequence of shared bits 
lai|,..-, [a,] with a; € {0,1}, Pre fixOr computes the 
sequence [b;],..., [b;] such that b; = V‘_,a,;. Nishide’s 
PrefixOr requires only 7 rounds but 17/ multiplica- 
tions. We implement Pre fzxOr based on the fact that 
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b; = b;-1 V a; and b; = a,. The logical OR (V) can 
be computed using a single multiplication: [x] V [y] = 
[x] + [y| — [x][y]. Thus, our Pre fixOr requires | — 1 
rounds and only / — 1 multiplications. 

Without compromising security properties, we re- 
place the PrefixOr in Nishide’s protocol by our opti- 
mized version and call the resulting comparison proto- 
col lessThan. A call of lessThan([a],[b]) outputs [1] 
if a < b and [0] otherwise. The overall complexity of 
lessThan is 241 + 5 multiplications in 2/ + 10 rounds as 
compared to Nishide’s version with 279/ + 5 multiplica- 
tions in 15 rounds. 


Short Range Check To further reduce multiplications 
for comparing small numbers, we devise a check for 
short ranges, based on our equal operation. Consider 
one wanted to compute [a] < 7’, where T is a small 
public constant, e.g, 7’ = 10. Instead of invoking 
lessThan(la],7') one can simply compute the polyno- 
mial [¢] = [a]({a] —1)({a] — 2)... ({a] — (7 —1)). If the 
value of a is between 0 and 7’ — 1, exactly one term of 
|@| will be zero and hence [¢] will evaluate to [0]. Oth- 
erwise, [¢] will be non-zero. Based on this, we define a 
protocol for checking short public ranges that returns [1| 
if x < [a] < y and [0] otherwise: 


shortRange({a], x,y) := equal (0, 1] ([a] — 4)) 


1=—* 


The complexity of shortRange is (y— x) +1+k—-2 
multiplications in / + log,(y — x) rounds. Computing 
lessThan(la], y) requires 16/ + 5 multiplications (1/3 is 
saved because y 1s public). Hence, regarding the number 
of multiplications, computing short Range([a], 0, y—1) 
instead of lessThan((a], y) is beneficial roughly as long 
as y < 15. 


4 SEPIA Protocols 


In this section, we compose the basic operations de- 
fined above into full-blown protocols for network event 
correlation and statistics aggregation. Each protocol is 
designed to run on continuous streams of input traffic 
data partitioned into time windows of a few minutes. For 
sake of simplicity, the protocols are specified for a single 
time window. We first define the basic setting of SEPIA 
protocols as illustrated in Fig. | and then introduce the 
protocols successively. 

Our system has a set of n users called input peers. The 
input peers want to jointly compute the value of a pub- 
lic function f(a1,...,%,,) on their private data x; with- 
out disclosing anything about x;. In addition, we have 
m players called privacy peers that perform the compu- 
tation of f() by simulating a trusted third party (TTP). 
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Each entity can take both roles, acting only as an input 
peer, privacy peer (PP) or both. 


Adversary Model and Security Assumptions We use 
the semi-honest (a.k.a. honest-but-curious) adversary 
model for privacy peers. That is, honest privacy peers 
follow the protocol and do not combine their informa- 
tion. Semi-honest privacy peers do follow the proto- 
col but try to infer as much as possible from the val- 
ues (shares) they learn, also by combining their informa- 
tion. The privacy and correctness guarantees provided 
by our protocols are determined by Shamir’s secret shar- 
ing scheme. In particular, the protocols are secure for 
t < m/2 semi-honest privacy peers, i.e., as long as the 
majority of privacy peers is honest. Even if some of the 
input peers do not trust each other, we think it is realistic 
to assume that they will agree on a set of most-trusted 
participants (or external entities) for hosting the privacy 
peers. Also, we think it is realistic to assume that the 
privacy peers indeed follow the protocol. If they are op- 
erated by input peers, they are likely interested in the 
correct outcome of the computation themselves and will 
therefore comply. External privacy peers are selected due 
to their good reputation or are being payed for a service. 
In both cases, they will do their best not to offend their 
customers by tricking the protocol. 

The function f() is specified as if a TTP was avail- 
able. MPC guarantees that no information is leaked from 
the computation process. However, just learning the re- 
sulting value f() could allow to infer sensitive informa- 
tion. For example, if the input bit of all input peers must 
remain secret, computing the logical AND of all input 
bits 1s insecure in itself: if the final result was 1, all in- 
put bits must be 1 as well and are thus no longer secret. 
It is the responsibility of the input peers to verify that 
learning f() is acceptable, in the same way as they have 
to verify this when using a real TTP. For example, we 
assume input peers are not willing to reconstruct item 
distributions but consider it safe to compute the overall 
item count or entropy. To reduce the potential for de- 
ducing information from f(), protocols can enforce the 
submission of “valid” input data conforming to certain 
rules. For instance, in our event correlation protocol, the 
privacy peers verify that each input peer submits no du- 
plicate events. More formally, the work on differential 
privacy [17] systematically randomizes the output f() of 
database queries to prevent inference of sensitive input 
data. 

Prior to running the protocols, the m privacy peers set 
up a secure, 1.e., confidential and authentic, channel to 
each other. In addition, each input peer creates a secure 
channel to each privacy peer. We assume that the re- 
quired public keys and/or certificates have been securely 
distributed beforehand. 
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Privacy-Performance Tradeoff Although the number 
of privacy peers m has a quadratic impact on the total 
communication and computation costs, there are also m 
privacy peers sharing the load. That is, if the network ca- 
pacity is sufficient, the overall running time of the proto- 
cols will scale linearly with m rather than quadratically. 
On the other hand, the number of tolerated colluding pri- 
vacy peers also scales linearly with m. Hence, the choice 
of m involves a privacy-performance tradeoff. The sep- 
aration of roles into input and privacy peers allows to 
tune this tradeoff independently of the number of input 
providers. 


4.1 Event Correlation 


The first protocol we present enables the input peers to 
privately aggregate arbitrary network events. An event e 
is defined by a key-weight pair e = (k,w). This no- 
tion is generic in the sense that keys can be defined to 
represent arbitrary types of network events, which are 
uniquely identifiable. The key & could for instance be 
the source IP address of packets triggering IDS alerts, 
or the source address concatenated with a specific alert 
type or port number. It could also be the hash value of 
extracted malicious payload or represent a uniquely iden- 
tifiable object, such as popular URLs, of which the in- 
put peers want to compute the total number of hits. The 
weight w reflects the impact (count) of this event (ob- 
ject), e.g., the frequency of the event in the current time 
window or a classification on a severity scale. 

Each input peer shares at most s local events per time 
window. The goal of the protocol is to reconstruct an 
event if and only if a minimum number of input peers 
T, report the same event and the aggregated weight is at 
least T),,.. The rationale behind this definition is that an 
input peer does not want to reconstruct local events that 
are unique in the set of all input peers, exposing sensitive 
information asymmetrically. But if the input peer knew 
that, for example, three other input peers report the same 
event, e.g., a specific intrusion alert, he would be willing 
to contribute his information and collaborate. Likewise, 
an input peer might only be interested in reconstructing 
events of a certain impact, having a non-negligible ag- 
gregated weight. 

More formally, let [e;;] = ([ki;], [wi;]) be the shared 
event 7 of input peer 2 with 7 < s andz < n. Then 
we compute the aggregated count C;; and weight W;, 
according to (2) and (3) and reconstruct e;,; iff (4) holds. 


[Cy] = >_> equal([kig], [ki-5/]) (2) 
i Ai, 7! 

Wis) = Sd > [wey] -equal([kig], [key]) @) 
i i,j! 

([Cij] = Te) A [Wis] = Tw) (4) 
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Reconstruction of an event e;; includes the reconstruc- 
tion of k;;, Ci;, Wi;, and the list of input peers reporting 
it, but the w;; remain secret. The detailed algorithm is 
given in Fig. 2. 


Input Verification In addition to merely implementing 
the correlation logic, we devise two optional input ver- 
ification steps. In particular, the PPs check that shared 
weights are below a maximum weight Wma, and that 
each input peer shares distinct events. These verifica- 
tions are not needed to secure the computation process, 
but they serve two purposes. First, they protect from mis- 
configured input peers and flawed input data. Secondly, 
they protect against input peers that try to deduce infor- 
mation from the final computation result. For instance, 
an input peer could add an event 7’, — 1 times (with a total 
weight of at least T),,) to find out whether any other in- 
put peers report the same event. These input verifications 
mitigate such attacks. 


Probe Response Attacks If aggregated security events 
are made publicly available, this enables probe response 
attacks against the system [5]. The goal of probe re- 
sponse attacks is not to learn private input data but 
to identify the sensors of a distributed monitoring sys- 
tem. To remain undiscovered, attackers then exclude 
the known sensors from future attacks against the sys- 
tem. While defending against this in general is an in- 
tractable problem, [41] identified that the suppression of 
low-density attacks provides some protection against ba- 
sic probe response attacks. Filtering out low-density at- 
tacks in our system can be achieved by setting the thresh- 
olds T,, and T;,, sufficiently high. 


Complexity The overall complexity, including verifica- 
tion steps, is summarized below in terms of operation 
invocations and rounds: 


equal: O((n — T,)ns?) 
lessThan: (2n — T,)s 
shortRange: (n — T.)s 
multiplications: (n—T,) - (ns? + s) 
rounds: 71 + logs(n — T.) + 26 


The protocol is clearly dominated by the number of 
equal operations required for the aggregation step. It 
scales quadratically with s, however, depending on T%, 
it scales linearly or quadratically with n. For instance, 
if T., has a constant offset to n (e.g., T. = n — 4), only 
O(ns*) equals are required. However, if T. = n/2, 
O(n?s?) equals are necessary. 


Optimizations To avoid the quadratic dependency on s, 
we are working on an MPC-version of a binary search 
algorithm that finds a secret [a] in a sorted list of se- 
crets {[b,],...,[b;]} with log, s comparisons by com- 
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. Share Generation: Each input peer 7 shares s distinct events e;; with w;; < Wmazx among the privacy peers (PPs). 
2. Weight Verification: Optionally, the PPs compute and reconstruct lessThan([w:;|, Wmaz) for all weights to verify that 
they are smaller than wmaz. Misbehaving input peers are disqualified. 
. Key Verification: Optionally, the PPs verify that each input peer z reports distinct events, 1.e., for each event index a and b 
with a < b they compute and reconstruct equal(|kia], [kin] ). Misbehaving input peers are disqualified. 
. Aggregation: The PPs compute [C;;] and [W;,] according to (2) and (3) for i < i witht = min(n — T. +. 1,n).* All 
required equal operations can be performed in parallel. 


. Reconstruction: For each event [e;;], with 1 < 2, condition (4) has to be checked. Therefore, the PPs compute 


[t1] = shortRange(|Ci;|, Tc, 2), 


[t2] = lessThan(Tw — 1, |Wi,]) 


Then, the event is reconstructed iff [t:] - [t2] returns 1. The set of input peers with 7 > i reporting a reconstructed event 
7 = (k,W) is computed by reusing all the equal operations performed on 7 in the aggregation step. That is, input peer 7’ 


reports 7 iff >|, equal([k}, [ki;]) equals 1. This can be computed using local addition for each remaining input peer and 
each reconstructed event. Finally, all reconstructed events are sent to all input peers. 





Figure 2: Algorithm for event correlation protocol. 


. Share Generation: Each input peer 7 shares its in- 
put vector dj = (21,%2,...,2,) among the PPs. 
That is, the PPs obtain n vectors of sharings [d;| = 


([x1], [v2], ---, [2+]). 


. Summation: The PPs compute the sum [D] = 
Si [di]. 

. Reconstruction: The PPs reconstruct all elements of 
D and send them to all input peers. 





Figure 3: Algorithm for vector addition protocol. 


paring [a] to the element in the middle of the list, here 
called [b,]. We then construct a new list, being the 
first or second half of the original list, depending on 
lessThan(la],[b.]). The procedure is repeated recur- 
sively until the list has size 1. This allows us to compare 
all events of two input peers with only O(s log, s) in- 
stead of O(s?) comparisons. To further reduce the num- 
ber of equal operations, the protocol can be adapted to 
receive incremental updates from input peers. That is, in- 
put peers submit a list of events in each time window and 
inform the PPs, which event entries have a different key 
from the previous window. Then, only comparisons of 
updated keys have to be performed and overall complex- 
ity is reduced to O(u(n — T..)s), where wu is the number 
of changed keys in that window. This requires, of course, 
that information on input set dynamics is not considered 
private. 


4.2 Network Traffic Statistics 


In this section, we present protocols for the compu- 
tation of multi-domain traffic statistics including the ag- 
gregation of additive traffic metrics, the computation of 
feature entropy, and the computation of distinct item 
count. These statistics find various applications in net- 
work monitoring and management. 
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1. Share Generation: Each input peer holds an r- 
dimensional private input vector s' € Z., representing 
the local item histogram, where r is the number of items 
and si, is the count for item k. The input peers share all 
elements of their s' among the PPs. 

. Summation: The PPs compute the item counts [s;] = 


527, [8%]. Also, the total count [S] = S>)_, [sx] is 


computed and reconstructed. 

. Exponentiation: The PPs compute [(s;,)%] using 
square-and-multiply. 

. Entropy Computation: The PPs compute the sum 
o = )., [(sx)%] and reconstruct o. Finally, at least 
one PP uses a to (locally) compute the Tsallis entropy 
HY) = Ay(1-o/S"). 





Figure 4: Algorithm for entropy protocol. 


4.2.1 Vector Addition 


To support basic additive functionality on timeseries and 
histograms, we implement a vector addition protocol. 
Each input peer 7 holds a private r-dimensional input 
vector d; € Li, Then, the vector addition protocol com- 
putes the sum D = ye d;. We describe the corre- 
sponding SEPIA protocol shortly in Fig. 3. This proto- 
col requires no distributed multiplications and only one 
round. 


4.2.2 Entropy Computation 


The computation of the entropy of feature distributions 
has been successfully applied in network anomaly detec- 
tion, e.g. [23, 9, 25, 50]. Commonly used feature distri- 
butions are, for example, those of IP addresses, port num- 
bers, flow sizes or host degrees. The Shannon entropy of 
a feature distribution Y is H(Y) = — )°, pr - logs (pr), 
where pz denotes the probability of an item k. If Y is 
a distribution of port numbers, p; is the probability of 
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port & to appear in the traffic data. The number of flows 
(or packets) containing item fk is divided by the overall 
flow (packet) count to calculate p,. Tsallis entropy is 
a generalization of Shannon entropy that also finds ap- 
plications in anomaly detection [50, 46]. It has been 
substantially studied with a rich bibliography available 
in [47]. The 1-parametric Tsallis entropy is defined as: 


HY) =——(1- my’). 
k 


q—1 


and has a direct interpretation in terms of moments of 
order q of the distribution. In particular, the Tsallis en- 
tropy is a generalized, non-extensive entropy that, up to 
a multiplicative constant, equals the Shannon entropy for 
q — 1. For generality, we select to design an MPC pro- 
tocol for the Tsallis entropy. 


Entropy Protocol A straight-forward approach to com- 
pute entropy is to first find the overall feature distribu- 
tion Y and then to compute the entropy of the distribu- 
tion. In particular, let p; be the overall probability of 
item k in the union of the private data and sj, the local 
count of item & at input peer 2. If S is the total count of 
the items, then py = § >-;_, sj,. Thus, to compute the 
entropy, the input peers could simply use the addition 
protocol to add all the sj,’s and find the probabilities p,. 
Each input peer could then compute H(Y ) locally. How- 
ever, the distribution Y can still be very sensitive as it 
contains information for each item, e.g., per address pre- 
fix. For this reason, we aim at computing H(Y) with- 
out reconstructing any of the values si, or pz. Because 
the rational numbers pz can not be shared directly over 
a prime field, we perform the computation separately on 
private numerators (si) and the public overall item count 
S. The entropy protocol achieves this goal as described 
in Fig. 4. It is assured that sensitive intermediate results 
are not leaked and that input and privacy peers only learn 
the final entropy value H,(Y) and the total count S. S$ 
is not considered sensitive as it only represents the total 
flow (or packet) count of all input peers together. This 
can be easily computed by applying the addition protocol 
to volume-based metrics. The complexity of this proto- 
col is r logs g multiplications in log, gq rounds. 


4.2.3 Distinct Count 


In this section, we devise a simple distinct count protocol 
leaking no intermediate information. Let si, € {0,1} be 
a boolean variable equal to 1 if input peer 2 sees item k 
and 0 otherwise. We first compute the logical OR of the 
boolean variables to find if an item was seen by any in- 
put peer or not. Then, simply summing the number of 
variables equal to 1 gives the distinct count of the items. 
According to De Morgan’s Theorem, aVb = —=(—a/A-7b). 
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. Share Generation: Each input peer 2 shares its negated 
local counts c}, = 4s%, among the PPs. 

. Aggregation: For each item k, the PPs compute [cx] = 
leg] A [eZ] A... [c2]. This can be done in log, n rounds. 


If an item k is reported by any input peer, then cx is 0. 

. Counting: Finally, the PPs build the sum [o] = 5 [cx| 
over all items and reconstruct 0. The distinct count is 
then given by K — o, where K is the size of the item 
domain. 





Figure 5: Algorithm for distinct count protocol. 


This means the logical OR can be realized by performing 
a logical AND on the negated variables. This is conve- 
nient, as the logical AND is simply the product of two 
variables. Using this observation, we construct the pro- 
tocol described in Fig. 5. This protocol guarantees that 
only the distinct count is learned from the computation; 
the set of items is not reconstructed. However, if the in- 
put peers agree that the item set is not sensitive it can 
easily be reconstructed after step 2. The complexity of 
this protocol is (n — 1)r multiplications in log, n rounds. 


5 Performance Evaluation 


In this Section we evaluate the event correlation proto- 
col and the protocols for network statistics. After that we 
explore the impact of running selected protocols on Plan- 
etLab where hardware, network delay, and bandwidth 
are very heterogeneous. This section is concluded with 
a performance comparison between SEPIA and existing 
general-purpose MPC frameworks. 

We assessed the CPU and network bandwidth require- 
ments of our protocols, by running different aggregation 
tasks with real and simulated network data. For each 
protocol, we ran several experiments varying the most 
important parameters. We varied the number of input 
peers n between 5 and 25 and the number of privacy 
peers m between 3 and 9, with m < n. The experiments 
were conducted on a shared cluster comprised of sev- 
eral public workstations; each workstation was equipped 
with a 2x Pentium 4 CPU (3.2 GHz), 2 GB memory, and 
100 Mb/s network. Each input and privacy peer was run 
on a separate host. In our plots, each data point reflects 
the average over 10 time windows. Background load due 
to user activity could not be totally avoided. Section 5.3 
discusses the impact of single slow hosts on the overall 
running time. 


5.1 Event Correlation 


For the evaluation of the event correlation protocol, 
we generated artificial event data. It is important to note 
that our performance metrics do not depend on the actual 
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Figure 6: Round statistics for event correlation with T, = n/2. s is the number of events per input peer. 


values used in the computation, hence artificial data is 
just as good as real data for these purposes. 


Running Time Fig. 6 shows evaluation results for event 
correlation with s = 30 events per input peer, each with 
24-bit keys for T, = n/2. We ran the protocol in- 
cluding weight and key verification. Fig. 6a shows that 
the average running time per time window always stays 
below 3.5 min and scales quadratically with n, as ex- 
pected. Investigation of CPU statistics shows that with 
increasing n also the average CPU load per privacy peer 
grows. Thus, as long as CPUs are not used to capacity, 
local parallelization manages to compensate parts of the 
quadratical increase. With T, = n — const, the running 
time as well as the number of operations scale linearly 
with n. Although the total communication cost grows 
quadratically with m, the running time dependence on 
m is rather linear, as long as the network is not satu- 
rated. The dependence on the number of events per input 
peer s is quadratic as expected without optimizations (see 
Fig. 6c). 

To study whether privacy peers spend most of their 
time waiting due to synchronization, we measured the 
user and system time of their hosts. All the privacy peers 
were constantly busy with average CPU loads between 
120% and 200% for the various operations.*> Communi- 
cation and computation between PPs is implemented us- 
ing separate threads to minimize the impact of synchro- 
nization on the overall running time. Thus, SEPIA profits 
from multi-core machines. Average load decreases with 
increasing need for synchronization from multiplications 
to equal, over lessT’han to event correlation. Never- 
theless, even with event correlation, processors are very 
busy and not stalled by the network layer. 


Bandwidth requirements Besides running time, the 
communication overhead imposed on the network is an 
important performance measure. Since data volume is 
dominated by privacy peer messages, we show the ay- 
erage bytes sent per privacy peer in one time window 
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in Fig. 6b. Similar to running time, data volume scales 
roughly quadratically with n and linearly with m. In 
addition to the transmitted data, each privacy peer re- 
ceives about the same amount of data from the other in- 
put and private peers. If we assume a 5-minute clocking 
of the event correlation protocol, an average bandwidth 
between 0.4 Mbps (for n = 5, m = 3) and 13 Mbps 
(for n = 25, m = Q) is needed per privacy peer. Assum- 
ing a 5-minute interval and sufficient CPU/bandwidth re- 
sources, the maximum number of supported input peers 
before the system stops working in real-time ranges from 
around 30 up to roughly 100, depending on protocol pa- 
rameters. 


5.2 Network statistics 


For evaluating the network statistics protocols, we 
used unsampled NetFlow data captured from the five 
border routers of the Swiss academic and research net- 
work (SWITCH), a medium-sized backbone operator, 
connecting approximately 40 governmental institutions, 
universities, and research labs to the Internet. We first 
extracted traffic flows belonging to different customers 
of SWITCH and assigned an independent input peer to 
each organization’s trace. For each organization, we then 
generated SEPIA input files, where each input field con- 
tained either the values of volume metrics to be added or 
the local histogram of feature distributions for collabora- 
tive entropy (distinct count) calculation. In this section 
we focus on the running time and bandwidth require- 
ments only. We performed the following tasks over ten 
5-minute windows: 


1. Volume Metrics: Adding 21 volume metrics con- 
taining flow, packet, and byte counts, both total and 
separately filtered by protocol (TCP, UDP, ICMP) 
and direction (incoming, outgoing). For example, 
Fig. 10 in Section 7.2 plots the total and local num- 
ber of incoming UDP flows of six organizations for 
an 11-day period. 
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Figure 7: Network statistics: avg. running time per time window versus n and m, measured on a department-wide 
cluster. All tasks were run with an input set size of 65k items. 


2. Port Histogram: Adding the full destination port 
histogram for incoming UDP flows. SEPIA input 
files contained 65,535 fields, each indicating the 
number of flows observed to the corresponding port. 
These local histograms were aggregated using the 
addition protocol. 

3. Port Entropy: Computing the Tsallis entropy of 
destination ports for incoming UDP flows. The lo- 
cal SEPIA input files contained the same informa- 
tion as for histogram aggregation. The Tsallis expo- 
nent g was set to 2. 

4. Distinct count of AS numbers: Aggregating the 
count of distinct source AS numbers in incom- 
ing UDP traffic. The input files contained 65,535 
columns, each denoting if the corresponding source 
AS number was observed. For this setting, we re- 
duced the field size p to 31 bits because the expected 
size of intermediate values is much smaller than for 
the other tasks. 


Running Time For task 1, the average running time was 
below 1.6s per time window for all configurations, even 
with 25 input and 9 privacy peers. This confirms that 
addition-only is very efficient for low volume input data. 
Fig. 7 summarizes the running time for tasks 2 to 4. The 
plots show on the y-axes the average running time per 
time window versus the number of input peers on the x- 
axes. In all cases, the running time for processing one 
time window was below 1.5 minutes. The running time 
clearly scales linearly with n. Assuming a 5-minute in- 
terval, we can estimate by extrapolation the maximum 
number of supported input peers before the system stops 
working in real-time. For the conservative case with 9 
privacy peers, the supported number of input peers is ap- 
proximately 140 for histogram addition, 110 for entropy 
computation, and 75 for distinct count computation. We 
observe, that for single round protocols (addition and en- 
tropy), the number of privacy peers has only little impact 
on the running time. For the distinct count protocol, the 
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running time increases linearly with both n and m. Note 
that the shortest running time for distinct count is even 
lower than for histogram addition. This is due to the 
reduced field size (p with 31 bits instead of 62), which 
reduces both CPU and network load. 


Bandwidth Requirements For all tasks, the data vol- 
ume sent per privacy peer scales perfectly linear with n 
and m. Therefore, we only report the maximum volume 
with 25 input and 9 privacy peers. For addition of vol- 
ume metrics, the data volume is 141 KB and increases to 
4.7 MB for histogram addition. Entropy computation re- 
quires 8.5 MB and finally the multi-round distinct count 
requires 50.5 MB. For distinct count, to transfer the total 
of 2-50.5 = 101 MB within 5 minutes, an average band- 
width of roughly 2.7 Mbps is needed per privacy peer. 


5.3. Internet-wide Experiments 


In our evaluation setting hosts have homogeneous 
CPUs, network bandwidth and low round trip times 
(RTT). In practice, however, SEPIA’s goal is to aggregate 
traffic from remote network domains, possibly resulting 
in a much more heterogeneous setting. For instance, high 
delay and low bandwidth directly affect the waiting time 
for messages. Once data has arrived, the CPU model and 
clock rate determine how fast the data is processed and 
can be distributed for the next round. 

Recall from Section 4 that each operation and pro- 
tocol in SEPIA is designed in rounds. Communication 
and computation during each round run in parallel. But 
before the next round can start, the privacy peers have 
to synchronize intermediate results and therefore wait 
for the slowest privacy peer to finish. The overall run- 
ning time of SEPIA protocols is thus affected by the 
slowest CPU, the highest delay, and the lowest band- 
width rather than by the average performance of hosts 
and links. Therefore we were interested to see whether 
the performance of our protocols breaks down if we take 
it out of the homogeneous LAN setting. Hence, we ran 
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LAN PlanetLab A PlanetLab B 





Max. RTT 1 ms 320 ms 320 ms 
Bandwidth 100 Mb/s > 100Kb/s => 100 Kb/s 
Slowest CPU 2 cores 2 cores 1 core 
3.2 GHz 2.4 GHz 1.8 GHz 
Running time 25.0s 36.85 110.48 


Table 1: Comparison of LAN and PlanetLab settings. 
Framework SEPIA VIFF FairplayMP 





Technique Shamir sh. Shamir sh. Bool. circuits 
Platform Java Python Java 
Multipl./s 82,730 326 1.6 
Equals/s 2,070 2.4 2.3 
LessThans/s 86 2.4 2.3 


Table 2: Comparison of frameworks performance in oper- 


ations per second with m = 5. 


SEPIA on PlanetLab [31] and repeated task 4 (distinct 
AS count) with 10 input and 5 privacy peers on globally 
distributed PlanetLab nodes. Table 1 compares the LAN 
setup with two PlanetLab setups A and B. 

RTT was much higher and average bandwidth much 
lower on PlanetLab. The only difference between Plan- 
etLab A and B was the choice of some nodes with slower 
CPUs. Despite the very heterogeneous and globally dis- 
tributed setting, the distinct count protocol performed 
well, at least in PlanetLab A. Most important, it still met 
our near real-time requirements. From PlanetLab A to B, 
running time went up by a factor of 3. However, this can 
largely be explained by the slower CPUs. The distinct 
count protocol consists of parallel multiplications, which 
make efficient use of the CPU and local addition, which 
is solely CPU-bound. Let us assume, for simplicity, that 
clock rates translate directly into MIPS. Then, computa- 
tional power in PlanetLab B is roughly 2.7 times lower 
than in PlanetLab A. Of course, the more rounds a pro- 
tocol has, the bigger is the impact of RTT. But in each 
round, the network delay is only a constant offset and 
can be amortized over the number of parallel operations 
performed per round. For many operations, CPU and 
bandwidth are the real bottlenecks. 

While aggregation in a heterogeneous environment 
is possible, SEPIA privacy peers should ideally be de- 
ployed on dedicated hardware, to reduce background 
load, and with similar CPU equipment, so that no single 
host slows down the entire process. 


5.4 Comparison with General-Purpose 
Frameworks 

In this section we compare the performance of ba- 

sic SEPIA operations to those of general-purpose frame- 


works such as FairplayMP [3] and VIFF v0.7.1 [15]. Be- 
sides performance, one aspect to consider is, of course, 
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usability. Whereas the SEPIA library currently only pro- 
vides an API to developers, FairplayMP allows to write 
protocols in a high-level language called SFDL and VIFF 
integrates nicely into the Python language. Furthermore, 
VIFF implements asynchronous protocols and provides 
additional functionality, such as security against mali- 
cious adversaries and support of MPC based on homo- 
morphic cryptosystems. 

Tests were run on 2x Dual Core AMD Opteron 275 
machines with 1Gb/s LAN connections. To guarantee a 
fair comparison, we used the same settings for all frame- 
works. In particular, the semi-honest model, 5 computa- 
tion nodes, and 32 bit secrets were used. Unlike VIFF 
and SEPIA, which use an information-theoretically se- 
cure scheme, FairplayMP requires the choice of an ade- 
quate security parameter k. We set k = 80, as suggested 
by the authors in [3]. 

Table 2 shows the average number of parallel oper- 
ations per second for each framework. SEPIA clearly 
outperforms VIFF and FairplayMP for all operations and 
is thus much better suited when performance of parallel 
Operations is of main importance. As an example, a run 
of event correlation taking 3 minutes with SEPIA would 
take roughly 2 days with VIFF. This extends the range 
of practically runnable MPC protocols significantly. No- 
tably, SEPIA’s equal operation is 24 times faster than 
its lessThan, which requires 24 times more multipli- 
cations, but at the same time also twice the number of 
rounds. This confirms that with many parallel opera- 
tions, the number of multiplications becomes the dom- 
inating factor. Approximately 3/4 of the time spent 
for lessThan is used for generating sharings of random 
numbers used in the protocol. These random sharings 
are independent from input data and could be generated 
prior to the actual computation, allowing to perform 380 
lessT’ hans per second in the same setting. 

Even for multiplications, SEPIA is faster than VIFF, 
although both rely on the same scheme. We assume this 
can largely be attributed to the completely asynchronous 
protocols implemented in VIFF. Whereas asynchronous 
protocols are very efficient for dealing with malicious 
adversaries, they make it impossible to reduce network 
overhead by exchanging intermediate results of all paral- 
lel operations at once in a single big message. Also, there 
seems to be a bottleneck in parallelizing large numbers 
of operations. In fact, when benchmarking VIFF, we no- 
ticed that after some point, adding more parallel opera- 
tions significantly slowed down the average running time 
per operation. 

Sharemind [6] is another interesting MPC framework 
using additive secret sharing to implement multiplica- 
tions and greater-or-equal (GTE) comparison. The au- 
thors implement it in C++ to maximize performance. 
However, the use of additive secret sharing makes the im- 
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plementations of basic operations dependent on the num- 
ber of computation nodes used. For this reason, Share- 
mind is currently restricted to 3 computation nodes only. 
Regarding performance, however, Sharemind is compa- 
rable to SEPIA. According to [6], Sharemind performs 
up to 160,000 multiplications and around 330 GTE op- 
erations per second, with 3 computation nodes. With 
3 PPs, SEPIA performs around 145,000 multiplications 
and 145 lessT’hans per second (615 with pre-generated 
randomness). Sharemind does not directly implement 
equal, but it could be implemented using 2 invocations 
of GTE, leading to © 115 operations/s. SEPIA’s equal 
is clearly faster with up to 3, 400 invocations/s. SEPIA 
demonstrates that operations based on Shamir shares are 
not necessarily slower than operations in the additive 
sharing scheme. The key to performance is rather an im- 
plementation, which is optimized for a large number of 
parallel operations. Thus, SEPIA combines speed with 
the flexibility of Shamir shares, which support any num- 
ber of computation nodes and are to a certain degree ro- 
bust against node failures. 


6 Design and Implementation 


The foundation of the SEPIA library is an implemen- 
tation of the basic operations, such as multiplications 
and optimized comparisons (see Section 3), along with 
a communication layer for establishing SSL connections 
between input and privacy peers. In order to limit the 
impact of varying communication latencies and response 
times, each connection, along with the corresponding 
computation and communication tasks, is handled by a 
separate thread. This also implies that SEPIA proto- 
cols benefit from multi-core systems for computation- 
intensive tasks. In order to reduce synchronization over- 
head, intermediate results of parallel operations sent to 
the same destination are collected and transfered in a sin- 
gle big message instead of many small messages. On top 
of the basic layers, the protocols from Section 4 are im- 
plemented as standalone command-line (CLI) tools. The 
CLI tools expect a local configuration file containing pri- 
vacy peer addresses, paths to a folder with input data and 
a Java keystore, as well as protocol-dependent parame- 
ters. The tools write a log of the ongoing computation 
and output files with aggregate results for each time win- 
dow. The keystore holds certificates of trusted input and 
privacy peers to establish SSL connections. It is possible 
to delay the start of a computation until a minimum num- 
ber of input and privacy peers are online. This gives the 
input peers the ability to define an acceptable level of pri- 
vacy by only participating in the computation if a certain 
number of other input/privacy peers also participate. 

SEPIA is written in Java to provide platform indepen- 
dence. The source code of the basic library and the four 
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ShamirSharing sharing = new ShamirSharing(); 
sharing.setFieldPrime (1401085391); // 31 bit 
sharing.setNrOfPrivacyPeers (nrOfPrivacyPeers) ; 
sharing anit) 3 


// Secreti: only a single value 
long[] secrets = new long[] {1234567}; 
long[][] shares = sharing.generateShares (secrets) ; 


// Send shares to each privacy peer 
for(int i=0; i<nrOfPrivacyPeers; itt) { 
connection[i].sendMessage(shares[1]); 


} 





Figure 8: Example code for an input peer that shares a 
secret, e.g., a millionaire sharing his amount of wealth. 


CLI tools is available under the LGPL license on the 
SEPIA project web page [39]. The web page also pro- 
vides pre-configured examples for the CLI tools and a 
user manual. The user manual describes usage and con- 
figuration of the CLI tools and includes a step-by-step 
tutorial on how to use the library API to develop new 
protocols. In the library API, all operations and sub- 
protocols implement a common interface [Operation 
and are easily composable. The class Protocol- 
Primitives allows to schedule operations and takes 
care of performing them in parallel, keeping track of 
operation states. A base class for privacy peers imple- 
ments the doOperations () method, which runs all 
the necessary computation rounds and synchronizes data 
between privacy peers in each round. Fig. 8 shows exam- 
ple code for input peers that want to privately compare 
their secrets. First, each input peer generates shares of 
its secret. The shares are then sent to the PPs, for which 
example code is shown in Fig. 9. The PPs first schedule 
and execute lessThan comparisons for all combinations 
of input secrets. In a second step, they run the recon- 
struction operations and output the results. 


Future Work Note that with Shamir shares, reconstruc- 
tion of results is assured as long as ¢ + 1 PPs are on- 
line and responsive. This can be used directly to extend 
SEPIA protocols with robustness against node failures. 
Also, weak nodes slowing down the entire computation 
could be excluded from the computation. We leave this 
as a future extension. 

The protocols support any number of input and pri- 
vacy peers. Also, the item set sizes/events per input peer 
are configurable and thus only limited by the available 
CPU/bandwidth resources. However, running the net- 
work statistics protocols (e.g., distinct count) on very 
large distributions, such as the global IP address range, 
requires to use sketches as proposed in [37] or binning 
(e.g., use address prefixes instead of addresses). As an 
example, we have recently used sketches in combination 
with SEPIA to efficiently compute top-k reports for dis- 
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. // receive all the shares from input peers 
ProtocolPrimitives primitives = new ProtocolPrimitives(fieldPrime, ...); 


// Schedule comparisons of all the input peer’s secrets 

int idl=1, id2=2, id3=3; // consecutive operation IDs 
primitives.lessThan(idl, new long[]{shareOfSecretl, shareOfSecret2}); 
primitives.lessThan(id2, new long[]{shareOfSecret2, shareOfSecret3}); 
primitives.lessThan(id3, new long[]{shareOfSecretl, shareOfSecret3}); 
doOperations(); // Process operations and sychronize intermediate results 


// Get shares of the comparison results 


long shareOfLessThanl2 = primitives.getResult (id1) 


long shareOfLessThan23 primitives.getResult (id2) ; 
long shareOfLessThan13 primitives.getResult (id3); 


// Schedule and perform reconstruction of comparisons 

primitives.reconstruct (idl, new long[] {shareOfLessThanl2}); 
primitives.reconstruct (id2, new long[] {shareOfLessThan23}); 
primitives.reconstruct (id3, new long[] {shareOfLessThanl3}); 


doOperations(); 


boolean secretl1_lessThan_secret2 (primitives.getResult (idl) ==1); 
boolean secret2_lessThan_secret3 (primitives.getResult (id2)==1); 
boolean secretl_lessThan_secret3 (primitives.getResult (id3) ==1); 





Figure 9: Example code for a PP receiving shares of secrets from 3 input peers. It then compares the secrets privately, 


e.g., to find which of the millionaires is the richest. 


tributed IP address distributions with up to 180,000 dis- 
tinct addresses [10]. 

As part of future work, we also plan to investigate 
the applicability of polynomial set representation to our 
statistics protocols, to reduce the linear dependency on 
the input set domain. Polynomial set representation, in- 
troduced by Freedman et al. [18] and extended by Kiss- 
ner et al. [22], represents set elements as roots of a poly- 
nomial and enables set operations that scale only loga- 
rithmically with input domain size. However, these solu- 
tions use homomorphic public-key cryptosystems, which 
come with significant overhead for basic operations. Fur- 
thermore, they do not trivially allow to separate roles 
into input and privacy peers, as each input provider is re- 
quired to perform certain non-delegable processing steps 
on its own data. 


7 Applications 


We envision four distinct aggregation scenarios us- 
ing SEPIA. The first scenario is aggregating informa- 
tion coming from multiple domains of one large (inter- 
national) organization. This aggregation is presently not 
always possible due to privacy concerns and heteroge- 
neous jurisdiction. The second scenario is analyzing pri- 
vate data owned by independent organizations with a mu- 
tual benefit in collaborating. Local ISPs, for example, 
might collaborate to detect common attacks. A third sce- 
nario provides access to researchers for evaluating and 
validating traffic analysis or event correlation prototypes 
over multi-domain network data. For example, national 
research, educational, and university networks could pro- 
vide SEPIA input and/or privacy peers that allow analyz- 
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ing local data according to submitted MPC scripts. Fi- 
nally, one last scenario is the privacy-preserving analy- 
sis of end-user data, i.e., end-user workstations can use 
SEPIA to collaboratively analyze and cross-correlate lo- 
cal data. 


7.1 Application Taxonomy 


Based on these scenarios, we see three different 
classes of possible SEPIA applications. 


Network Security Over the last years, considerable re- 
search efforts have focused on distributed data aggrega- 
tion and correlation systems for the identification and 
mitigation of coordinated wide-scale attacks. In par- 
ticular, aggregation enables the (early) detection and 
characterization of attacks spanning multiple domains 
using data from IDSes, firewalls, and other possible 
sources [2, 16, 26, 49]. Recent studies [21] show that 
coordinated wide-scale attacks are prevalent: 20% of the 
studied malicious addresses and 40% of the IDS alerts 
accounted for coordinated wide-scale attacks. Further- 
more, strongly correlated groups profiting most from col- 
laboration have less than 10 members and are stable over 
time, which is well-suited for SEPIA protocols. 

In order to counter such attacks, Yegneswaran et 
al. [49] presented DOMINO, a distributed IDS that en- 
ables collaboration among nodes. They evaluated the 
performance of DOMINO with a large set of IDS logs 
from over 1600 providers. Their analysis demonstrates 
the significant benefit that 1s obtained by correlating the 
data from several distributed intrusion data sources. The 
major issue faced by such correlation systems is the lack 
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of data privacy. In their work, Porras et al. survey exist- 
ing defense mechanisms and propose several remaining 
research challenges [32]. Specifically, they point out the 
need for efficient privacy-preserving data mining algo- 
rithms that enable traffic classification, signature extrac- 
tion, and propagation analysis. 


Profiling and Performance Analysis A second cate- 
gory of applications relates to traffic profiling and perfor- 
mance measurements. A global profile of traffic trends 
helps organizations to cross-correlate local traffic trends 
and identify changes. In [38] the authors estimate that 
50 of the top-degree ASes together cover approximately 
90% of global AS-paths. Hence, if large ASes col- 
laborate, the computation of global Internet statistics is 
within reach. One possible statistic is the total traffic vol- 
ume across a large number of networks. This statistic, for 
example, could have helped [37] in the dot-com bubble 
in the late nineties, since the traffic growth rate was over- 
estimated by a factor of 10, easing the flow of venture 
capital to Internet start-ups. In addition, performance- 
related applications can benefit from an “on average” 
view across multiple domains. Data from multiple do- 
mains can also help to locate a remote outage with higher 
confidence, and to trigger proper detour mechanisms. A 
number of additional MPC applications related to perfor- 
mance monitoring are discussed in [36]. 


Research Validation Many studies are obliged to avoid 
rigorous validation or have to re-use a small number of 
old traffic traces [13, 43]. This situation clearly under- 
mines the reliability of the derived results. In this con- 
text, SEPIA can be used to establish a privacy-preserving 
infrastructure for research validation purposes. For ex- 
ample, researchers could provide MPC scripts to SEPIA 
nodes running at universities and research institutes. 


7.2 Case Study: The Skype Outage 


The Skype outage in August 2007 started from a 
Windows update triggering a large number of system 
restarts. In response, Skype nodes scanned cached host- 
lists to find supernodes causing a huge distributed scan- 
ning event lasting two days [35]. We used NetFlow traces 
of the actual up- and downstream traffic of the 17 biggest 
customers of the SWITCH network. The traces span 11 
days from the 11th to 22nd and include the Skype outage 
(on the 16th/17th) along with other smaller anomalies. 
We ran SEPIA’s total count, distinct count, and entropy 
protocols on these traces and investigated how the orga- 
nizations can benefit by correlating their local view with 
the aggregate view. 

We first computed per-organization and aggregate 
timeseries of the UDP flow count metric and applied a 
simple detector to identify anomalies. For each time- 
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Figure 10: Flow count in 5’ windows with anomalies 
for the biggest organizations and aggregate view (ALL). 
Each organization sees its local and the aggregate traffic. 
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Figure 11: Correlation of local and global anomalies for 
organizations ordered by size (1=biggest). 


series, we used the first 4 days to learn its mean pz and 
standard deviation o, defined the normal region to be 
within j4+ 30, and detected anomalous time intervals. In 
Fig. 10 we illustrate the local timeseries for the six largest 
organizations and the aggregate timeseries. We rank or- 
ganizations based on their decreasing average number of 
daily flows and use their rank to identify them. In the 
figure, we also mark the detected anomalous intervals. 
Observe that in addition to the Skype outage, some orga- 
nizations detect other smaller anomalies that took place 
during the 11-day period. 


Anomaly Correlation Using the aggregate view, an or- 
ganization can find if a local anomaly is the result of 
a global event that may affect multiple organizations. 
Knowing the global or local nature of an anomaly is im- 
portant for steering further troubleshooting steps. There- 
fore, we first investigate how the local and global anoma- 
lous intervals correlate. For each organization, we com- 
pared the local and aggregate anomalous intervals and 
measured the total time an anomaly was present: 1) only 
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in the local view, 2) only in the aggregate view, and 3) 
both in the local and aggregate views, 1.e., the matching 
anomalous intervals. Fig. 11 illustrates the correspond- 
ing time fractions. We observe a rather small fraction, 
1.e., on average 14.1%, of local-only anomalies. Such 
anomalies lead administrators to search for local targeted 
attacks, misconfigured or compromised internal systems, 
misbehaving users, etc. In addition, we observe an aver- 
age of 20.3% matching anomalous windows. Knowing 
an anomaly is both local and global steers an affected 
organization to search for possible problems in popular 
services, in widely-used software, like Skype in this case, 
or in the upstream providers. A large fraction (65.6%) of 
anomalous windows is only visible in the global view. 
In addition, we observe significant variability in the pat- 
terns of different organizations. In general, larger organi- 
zations tend to have a larger fraction of matching anoma- 
lies, as they contribute more to the aggregate view. While 
some organizations are highly correlated with the global 
view, e.g., organization 3 that notably contributes only 
7.4% of the total traffic; other organizations are barely 
correlated, e.g., organizations 9 and 12; and organization 
2 has no local anomalies at all. 


Anomaly Troubleshooting We define relative anomaly 
size to be the ratio of the detection metric value during an 
anomalous interval over the detection threshold. Organi- 
zations 3 and 4 had relative anomaly sizes 11.7 and 18.8, 
which is significantly higher than the average of 2.6. Us- 
ing the average statistic, organizations can compare the 
relative impact of an attack. Organization 2, for instance, 
had anomaly size 0 and concludes that there was a large 
anomaly taking place but they were not affected. Most 
of the organizations conclude that they were indeed af- 
fected, but less than average. Organizations 3 and 4, 
however, have to spend thoughts on why the anomaly 
was so disproportionately strong in their networks. 


An investigation of the full port distribution and its 
entropy (plots omitted due to space limitations) shows 
that affected organizations see a sudden increase in scan- 
ning activity on specific high port numbers. Connections 
originate mainly from ports 80 and 443, 1.e., the fall- 
back ports of Skype, and a series of high port numbers 
indicating an anomaly related to Skype. For organiza- 
tions 3 and 4, some of the scanned high ports are ex- 
tremely prevalent, i.e., a single destination port accounts 
for 93% of all flows at the peak rate. Moreover, most of 
the anomalous flows within organizations 3 and 4 are tar- 
geted at a single IP address and originate from thousands 
of distinct source addresses connecting repeatedly up to 
13 times per minute. These patterns indicate that the two 
organizations host popular supernodes, attracting a lot of 
traffic to specific ports. Other organizations mainly host 
client nodes and see uniform scanning, while organiza- 
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Org # 3 5 6 7 13. «17 
lag [hours] | 1.2 2.7 23.4 15.55 48 3.6 


Table 3: Organizations profiting from an early anomaly 
warning by aggregation. 


tion 2 has banned Skype completely. Based on this anal- 
ysis, organizations can take appropriate measures to mit- 
igate the impact of the 2-day outage, like notifying users 
or blocking specific port numbers. 


Early-warning Finally, we investigate whether the ag- 
gregate view can be useful for building an early-warning 
system for global or large-scale anomalies. The Skype 
anomaly did not start concurrently in all locations, since 
the Windows update policy and reboot times were differ- 
ent across organizations. We measured the lag between 
the time the Skype anomaly was first observed in the ag- 
gregate and local view of each organization. In Table 3 
we list the organizations that had considerable lag, 1.e., 
above an hour. Notably, one of the most affected orga- 
nizations (6) could have learned the anomaly almost one 
day ahead. However, as shown in Fig. 11, for organiza- 
tion 2 this would have been a false positive alarm. To 
profit most from such an early warning system in prac- 
tice, the aggregate view should be annotated with addi- 
tional information, such as the number of organizations 
or the type of services affected from the same anomaly. 
In this context, our event correlation protocol is useful to 
decide whether similar anomaly signatures are observed 
in the participating networks. Anomaly signatures can be 
extracted automatically using actively researched tech- 
niques [8, 33]. 


$ Related Work 


Most related to our work, Roughan and Zhan [37] first 
proposed the use of MPC techniques for a number of 
applications relating to traffic measurements, including 
the estimation of global traffic volume and performance 
measurements [36]. In addition, the authors identified 
that MPC techniques can be combined with commonly- 
used traffic analysis methods and tools, such as time- 
series algorithms and sketch data structures. Our work is 
similar in spirit, yet it extends their work by introducing 
new MPC protocols for event correlation, entropy, and 
distinct count computation and by implementing these 
protocols in a ready-to-use library. 

Data correlation systems that provide strong privacy 
guarantees for the participants achieve data privacy by 
means of (partial) data sanitization based on bloom fil- 
ters [44] or cryptographic functions [26, 24]. However, 
data sanitization is in general not a lossless process and 
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therefore imposes an unavoidable tradeoff between data 
privacy and data utility. 

The work presented by Chow et al. [12] and Apple- 
baum et al. [1] avoid this tradeoff by means of cryp- 
tographic data obfuscation. Chow et al. proposed a 
two-party query computation model to perform privacy- 
preserving querying of distributed databases. In addi- 
tion to the databases, their solution comprises three en- 
tities: the randomizer, the computing engine, and the 
query frontend. Local answers to queries are random- 
ized by each database and the aggregate results are de- 
randomized at the frontend. Applebaum et al. present 
a semi-centralized solution for the collaboration among 
a large number of participants in which responsibility is 
divided between a proxy and a central database. In a 
first step the proxy obliviously blinds the clients’ input, 
consisting of a set of keyword/value pairs, and stores the 
blinded keywords along with the non-blinded values in 
the central database. On request, the database identifies 
the (blinded) keywords that have values satisfying some 
evaluation function and forwards the matching rows to 
the proxy, which then unblinds the respective keywords. 
Finally, the database publishes its non-blinded data for 
these keywords. As opposed to these approaches, SEPIA 
does not depend on two central entities but in general 
supports an arbitrary number of distributed privacy peers, 
is provably secure, and more flexible with respect to the 
functions that can be executed on the input data. The 
similarities and differences between our work and exist- 
ing general-purpose MPC frameworks are discussed in 
Sec. 5.4. 


9 Conclusion 


The aggregation of network security and monitoring 
data is crucial for a wide variety of tasks, including col- 
laborative network defense and cross-sectional Internet 
monitoring. Unfortunately, concerns regarding privacy 
prevent such collaboration from materializing. In this 
paper, we investigated the practical usefulness of solu- 
tions based on secure multiparty computation (MPC). 
For this purpose, we designed optimized MPC operations 
that run efficiently on voluminous input data. We im- 
plemented these operations in the SEPIA library along 
with a set of novel protocols for event correlation and 
for computing multi-domain network statistics, 1.e., en- 
tropy and distinct count. Our evaluation results clearly 
demonstrate the efficiency and scalability of SEPIA in 
realistic settings. With COTS hardware, near real-time 
operation is practical even with 140 input providers and 
9 computation nodes. Furthermore, the basic operations 
of the SEPIA library are significantly faster than those 
of existing MPC frameworks and can be used as build- 
ing blocks for arbitrary protocols. We believe that our 
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work provides useful insights into the practical utility of 
MPC and paves the way for new collaboration initiatives. 
Our future work includes improving SEPIA’s robustness 
against host failures, dealing with malicious adversaries, 
and further improving performance, using, for example, 
polynomial set representations. Furthermore, in collab- 
oration with a major systems management vendor, we 
have started a project that aims at incorporating MPC 
primitives into a mainstream traffic profiling product. 
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Notes 


'We define near real-time as the requirement of fully processing 
an x-minute interval of traffic data in no longer than x minutes, where 
x 1s typically a small constant. For our evaluation, we use 5-minute 
windows. 

For instance, if n = 10 and T.. = 7, each event that needs to be 
reconstructed according to (4) must be reported by at least one of the 
first 4 input peers. Hence, it is sufficient to compute the C;,; and W;;, 
for the first n — T; + 1 = 4 input peers. 

3When run on a 32-bit platform, up to twice the CPU load was ob- 
served, with similar overall running time. This difference is due to 
shares being stored in long variables, which are more efficiently pro- 
cessed on 64-bit CPUs. 
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Abstract 


Many applications of IP geolocation can benefit from ge- 
olocation that is robust to adversarial clients. These in- 
clude applications that limit access to online content to a 
specific geographic region and cloud computing, where 
some organizations must ensure their virtual machines 
stay in an appropriate geographic region. This paper 
studies the applicability of current IP geolocation tech- 
niques against an adversary who tries to subvert the tech- 
niques into returning a forged result. We propose and 
evaluate attacks on both delay-based IP geolocation tech- 
niques and more advanced topology-aware techniques. 
Against delay-based techniques, we find that the adver- 
sary has a clear trade-off between the accuracy and the 
detectability of an attack. In contrast, we observe that 
more sophisticated topology-aware techniques actually 
fare worse against an adversary because they give the 
adversary more inputs to manipulate through their use 
of topology and delay information. 


1 Introduction 


Many applications benefit from using IP geolocation to 
determine the geographic location of hosts on the In- 
ternet. For example, online advertisers and search en- 
gines tailor their content based on the client’s location. 
Currently, geolocation databases such as Quova [22] and 
MaxMind [16] are the most popular method used by ap- 
plications that need geolocation services. 

Geolocation is also used in many security-sensitive ap- 
plications. Online content providers such as Hulu [13], 
BBC iPlayer [22], RealMedia [22] and Pandora [20], 
limit their content distribution to specific geographic re- 
gions. Before allowing a client to view the content, they 
determine the client’s location from its IP address and al- 
low access only if the client is in a permitted jurisdiction. 
In addition, Internet gambling websites must restrict ac- 
cess to their applications based on the client’s location 
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or risk legal repercussions [29]. Accordingly, these busi- 
nesses rely on geolocation to limit access to their online 
Services. 


Looking forward, the growth of infrastructure-as-a- 
service clouds, such as Amazon’s EC2 service [1], may 
also drive organizations using cloud computing to em- 
ploy geolocation. Users of cloud computing deploy VMs 
on a cloud provider’s infrastructure without having to 
maintain the hardware their VM is running on. However, 
differences in laws governing issues such as privacy, 1n- 
formation discovery, compliance and audit require that 
some cloud users to restrict VM locations to certain juris- 
dictions or countries [6]. These location restrictions may 
be specified as part of a service level agreement (SLA) 
between the cloud user and provider. Cloud users can 
use IP geolocation to independently verify that the loca- 
tion restrictions in their cloud SLAs are met. 


In these cases, the target of geolocation has an incen- 
tive to mislead the geolocation system about its true lo- 
cation. Clients commonly use proxies to mislead content 
providers so they can view content that is unauthorized 
in their geographic region. In response, some content 
providers [13] however, have identified and blocked ac- 
cess from known proxies; but this does not prevent all 
clients from circumventing geographic controls. Sim- 
ilarly, cloud providers may attempt to break location 
restrictions in their SLAs to move customer VMs to 
cheaper locations. Governments that enforce location re- 
quirements on the cloud user may require the geoloca- 
tion checks to be robust no matter what a cloud provider 
may do to mislead them. Even if the cloud provider itself 
is not malicious, its employees may also try to relocate 
VMs to locations where they can be attacked by other 
malicious VMs [24]. Thus, while cloud users might trust 
the cloud service provider, they may still be required to 
cd ..have independent verification of the location of their 
VMs to meet audit requirements or to avoid legal liabil- 


ity. 
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IP geolocation has been an active field of research for 
almost a decade. However, all current geolocation tech- 
niques assume a benign target that is not trying to in- 
tentionally mislead the user, and there has been limited 
work on geolocating malicious targets. Castelluccia et 
al. apply Constraint-Based Geolocation (CBG) [12] to 
the problem of geolocating fast-flux hidden servers that 
use a layer of proxies in a botnet [5] to conceal their loca- 
tion. Muir and Oorschot [18] describe limitations of pas- 
sive geolocation techniques (e.g., whois services) and 
present a technique for finding the IP address of a ma- 
chine using the Tor anonymization network [28]. These 
previous works focus on de-anonymization of hosts be- 
hind proxies, while our contribution in this paper is to 
answer fundamental questions about whether current ge- 
olocation algorithms are suitable for security-sensitive 
applications: 


e Are current geolocation algorithms accurate 
enough to locate an IP within a certain country 
or jurisdiction? We answer this question by sur- 
veying previously published studies of geolocation 
algorithms. We find that current algorithms have 
accuracies of 35-194 km, making them suitable for 
geolocation within a country. 


e How can adversaries attack a geolocation sys- 
tem? We propose attacks on two broad classes of 
measurement-based geolocation algorithms — those 
relying on network delay measurements and those 
using network topology information. To evaluate 
the practicality of these attacks, we categorize ad- 
versaries into two classes — a simple adversary that 
can manipulate network delays and a sophisticated 
one with control over a set of routable IP addresses. 


e How effective are such attacks? Can they be 
detected? We evaluate our attacks by analyzing 
them against models of geolocation algorithms. We 
also perform an empirical evaluation using mea- 
surements taken from PlanetLab [21] and execut- 
ing attacks on implementations of delay-based and 
topology-aware geolocation algorithms. We ob- 
serve the simple adversary has limited accuracy and 
must trade off accuracy for detectability of their at- 
tack. On the other hand, the sophisticated adversary 
has higher accuracy and remains difficult to detect. 


The rest of this paper is structured as follows. Sec- 
tion 2 summarizes relevant background and previous 
work on geolocation techniques. The security model and 
assumptions we use to evaluate current geolocation pro- 
posals is described in Section 3. We develop and ana- 
lyze attacks on delay-based and topology-aware geolo- 
cation methods in Sections 4 and 5, respectively. Sec- 
tion 6 presents related work that evaluates geolocation 
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when confronted by a target that leverages proxies. We 
present conclusions in Section 7. 


2 Geolocation Background 


IP geolocation aims to solve the problem of determin- 
ing the geographic location of a given IP address. The 
solution can be expressed to varying degrees of granu- 
larity; for most applications the result should be precise 
enough to determine the city in which the IP is located, 
either returning a city name or the longitude and latitude 
where the target is located. The two main approaches to 
geolocation use either active network measurements to 
determine the location of the host or databases of IP to 
location mappings. 

Measurement-based geolocation algorithms [9, 12, 14, 
19, 30, 31] leverage a set of geographically distributed 
landmark hosts with known locations to locate the tar- 
get IP. These landmarks measure various network prop- 
erties, such as delay, and the paths taken by traffic be- 
tween themselves and the target. These results are used 
as input to the geolocation algorithm which uses them 
to determine the target’s location using methods such as: 
constraining the region where the target may be located 
(geolocalization) [12,30], iterative force directed algo- 
rithms [31], machine learning [9] and constrained opti- 
mization [14]. 

Geolocation algorithms mainly rely on ping [7] and 
traceroute [7] measurements. Ping measures the 
round-trip time (RTT) delay between two machines on 
the Internet, while traceroute discovers and mea- 
sures the RTT to routers along the path to a given des- 
tination. We classify measurement-based geolocation al- 
gorithms by the type of measurements they use to deter- 
mine the target’s location. We refer to algorithms that 
use end-to-end RTTs as delay-based [9, 12,31] and those 
that use both RTT and topology information as topology- 
aware algorithms [14, 30]. 

An alternative to measurement-based geolocation is 
geolocation using databases of IP to location mappings. 
These databases can be either proprietary or public. Pub- 
lic databases include those administered by regional In- 
ternet registries (e.g., ARIN [3], RIPE [23]). Propri- 
etary databases of IP to geographic location mappings 
are provided by companies such as Quova [22] and Max- 
mind [16]. While the exact method of constructing these 
databases is not public, they are sometimes based on a 
combination of whois services, DNS LOC records and 
autonomous system (AS) numbers [2]. Registries and 
databases tend to be coarse grained, usually returning the 
headquarters location of the organization that registered 
the IP address. This becomes a problem when organiza- 
tions distribute their IP addresses over a wide geographic 
region, such as large ISPs or content providers. Mislead- 
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Table 1: Average accuracy of measurement-based geolocation algorithms. 


| Class Algorithm Average accuracy (km) 


CBG [12] 78- 5 


GeoPing [19] 150 km (25th percentile); 109 km (median) [30] 
Delay-based 
Statistical | Statistical [31] | 1] 


Learning- et [9] | 407-449 (113 km less than CBG [12] on their ian 


TBG /TBG[14) 


Topology-aware ee [30] 35-40 (median) 
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GeoTrack [19] 156 km (median) [30] 


ing database geolocation 1s also straightforward through 
the use of proxies. 

DNS LOC [8] is an open standard that allows DNS ad- 
ministrators to augment DNS servers with location infor- 
mation, effectively creating a publicly available database 
of IP location information. However, it has not gained 
widespread usage. In addition, since the contents of the 
DNS LOC database are not authenticated and are set by 
the owners of the IP addresses themselves, it is poorly 
suited for security-sensitive applications. 

Much research has gone into improving the accuracy 
of measurement-based geolocation algorithms; conse- 
quently, they provide fairly reliable results. Table 1 
shows the reported average accuracies of recently pro- 
posed geolocation algorithms. Based on the reported ac- 
curacies, we believe that current geolocation algorithms 
are sufficiently accurate to place a machine within a 
country or jurisdiction. In particular, CBG [12] and Oc- 
tant [30] appear to offer accuracies well within the size 
of most countries and may even be able to place users 
within a metropolitan area. Measurement-based geoloca- 
tion is particularly appealing for secure geolocation be- 
cause if a measurement can reach the target (e.g., using 
application layer measurements [17]), even if it is behind 
a proxy (e.g., SOCKS or HTTP proxy), the effectiveness 
of proxying will be diminished. 


3 Security Model 


We model secure geolocation as a three-party problem. 
First, there is the geolocation user or victim. The user 
hopes to accurately determine the location of the target 
using a geolocation ween that relies on measure- 
ments of network properties!. We assume that; (1) the 
user has access to a number of landmark machines dis- 
tributed around the globe to make measurements of RTTs 
and network paths, and (2) the user trusts the results of 
measurements reported by landmarks. Second, there is 
the adversary, who owns the target’s IP address. The ad- 
versary would like to mislead the user into believing that 
the target is at a forged location of the adversary’s choos- 
ing, when in reality the target is actually located at the 
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true location. The adversary is responsible for physically 
connecting the target IP address to the Internet, which 
allows them to insert additional machines or routers be- 
tween the target and the Internet. The third party is the 
Internet itself. While the Internet is impartial to both ad- 
versary and user, it introduces additive noise as a result 
of queuing delays and circuitous routes. These properties 
introduce some inherent inaccuracy and unpredictability 
into the results of measurements on which geolocation 
algorithms rely. In general, an adversary’s malicious 
tampering with network properties (such as adding de- 
lay), if done in small amounts, is difficult to distinguish 
from additive noise introduced by the Internet. 


This work addresses two types of adversaries with dif- 
fering capabilities. We assume in both cases that the ad- 
versary is fully aware of the geolocation algorithm and 
knows both the IP addresses and locations of all land- 
marks used in the algorithm. The first, simple adver- 
sary can tamper only with the RTT measurements taken 
by the landmarks. This can be done by selectively de- 
laying packets from landmarks to make the RTT appear 
larger than it actually is. The simple adversary was cho- 
sen to resemble a home user running a program to selec- 
tively delay responses to measurements. The second, so- 
phisticated adversary, controls several IP addresses and 
can use them to create fake routers and paths to the tar- 
get. Further, this adversary may have a wide area net- 
work (WAN) with several gateway routers and can influ- 
ence BGP routes to the target. The sophisticated adver- 
sary was chosen to model a cloud provider as the adver- 
sary. Many large online service providers already deploy 
WANs [11], making this attack model feasible with low 
additional cost to the provider. 


We make two assumptions in this work. First, while 
aware of the geolocation algorithm being used, and the 
location and IP addresses of all landmarks, the adver- 
sary cannot compromise the landmarks or run code on 
them. Thus, the only way the adversary can compromise 
the integrity of network measurements is to modify the 
properties of traffic traveling on network links directly 
connected to a machine under its control. 
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The second assumption is that network measurements 
made by landmarks actually reach the target. Otherwise, 
an adversary could trivially attack the geolocation system 
by placing a proxy at the forged location that responds to 
all geolocation traffic and forwards all other traffic to the 
true location. To avoid this attack, the user can either 
combine the measurements with regular traffic or protect 
it using cryptography. For example, if the geolocation 
user is a Web content provider, Muir and Oorschot [18] 
have shown that even an anonymization network such as 
Tor [28] may be defeated using a Java applet embedded 
in a Web page. Users who want to geolocate a VM ina 
compute cloud may require the cloud provider to support 
tamper-proof VMs [10,25] and embed a secret key in 
the VM for authenticating end-to-end network measure- 
ments. In this case, the adversary would need to place a 
copy of the VM in the forged location to respond to mea- 
surements. Given that the adversary is trying to avoid 
placing a VM in the forged location, it is not a practical 
attack for a malicious cloud provider. 


4 Delay-based geolocation 


Delay-based geolocation algorithms use measurements 
of end-to-end network delays to geolocate the target IP. 
To execute delay-based geolocation, the landmarks need 
to calibrate the relationship between geographic distance 
and network delay. This is done by having each land- 
mark, L,;, ping all other landmarks. Since the landmarks 
have known geographic locations, L; can then derive a 
function mapping geographic distance, g;;, to network 
delay, d;;, observed to each other landmark L,; where 
1 # j [12]. Each landmark performs this calibration and 
develops its own mapping of geographic distance to net- 
work delay. After calibrating its distance-to-delay func- 
tion, it then pings the target IP. Using the distance-to- 
delay function, the landmark can then transform the ob- 
served delay to the target into a predicted distance to the 
target. All landmarks perform this computation to trian- 
gulate the location of the target. 

Delay-based geolocation operates under the implicit 
assumption that network delay is well correlated with ge- 
ographic distance. However, network delay is composed 
of queuing, processing, transmission and propagation de- 
lay [15]. Where only the propagation time of network 
traffic is related to distance traveled, and the other com- 
ponents vary depending on network load, thus adding 
noise to the measured delay. This assumption is also vio- 
lated when network traffic does not take a direct (“‘as the 
crow flies’) path between hosts. These indirect paths are 
referred to as “circuitous” routes [30]. 

There are many proposed methods for delay-based ge- 
olocation, including GeoPing [19], Statistical Geoloca- 
tion [31], Learning-based Geolocation [9] and CBG [12]. 
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These algorithms differ in how they express the distance- 
to-delay function and how they triangulate the position of 
the target. GeoPing 1s based on the observation that hosts 
that are geographically close to each other will have de- 
lay properties similar to the landmark nodes [19]. Sta- 
tistical Geolocation develops a joint probability density 
function of distance to delay that is input into a force- 
directed algorithm used to geolocate the target [31]. In 
contrast, Learning-based Geolocation utilizes a Naive 
Bayes framework to geolocate a target IP given a set of 
measurements [9]. CBG has the highest reported accu- 
racy of the delay-based algorithms, with a mean error of 
78-182 km [12]. The remainder of this section therefore 
focuses on CBG to model and evaluate how an adversary 
can influence delay-based geolocation techniques. 

CBG [12] establishes the distance-delay function, de- 
scribed above, by having the landmarks ping each other 
to derive a set of points (g;;,d;;) mapping geographic 
distance to network delay. To mitigate the effects of 
congestion on network delays, multiple measurements 
are made, and the 2.5-percentile of network delays are 
used by the landmarks to calibrate their distance-to-delay 
mapping. Each landmark then computes a linear (“‘best 
line’) function that is closest to, but below, the set of 
points. Distance between each landmark and the target 
IP is inferred using the “best line” function. This gives 
an implied circle around each landmark where the tar- 
get IP may be located. The target IP is then predicted to 
be in the region of intersection of the circles of all the 
landmarks. Since the result of this process is a feasible 
region where the target may be located, CBG determines 
the centroid of the region and returns this value as the 
geolocation result. Gueye et al. observe a mean error 
of 182 km in the US and 78 km in Europe. They also 
find that the feasible region where the target IP may be 
located ranges from 10* km? in Europe to 10° km? in 
North America. 


4.1 Attack on delay-based geolocation 


Since delay-based geolocation techniques do not take 
network topology into account, the ability of a sophis- 
ticated adversary to manipulate network paths is of no 
additional value. Against a delay-based geolocation al- 
gorithm, the simple and sophisticated adversaries have 
equal power. 

To mislead delay-based geolocation, the adversary can 
manipulate distance of the target computed by the land- 
marks by altering the delay observed by each landmark. 
The adversary knows the identities and locations of each 
landmark and can thus identify traffic from the land- 
marks and alter the delay as necessary. To make the tar- 
get at the true location, ¢, appear to be at forged location, 
T, the adversary must alter the perceived delay, d;;, be- 
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Figure 1: Landmarks (PlanetLab nodes) used in evalua- 


tion. 


tween each landmark, L; and ¢t to become the delay, d;,, 
each landmark should perceive between L; and 7. To do 
this, two problems must be solved. The adversary must 
first find the appropriate delay, d;,, for each landmark 
and then change the perceived delay to the appropriate 
delay. 

If the adversary controls a machine at or near 7, she 
may directly acquire the appropriate d;- for each land- 
mark by pinging each of the landmarks from the forged 
location 7. However, pings to all the landmarks from 
a machine not related to the geolocation algorithm may 
arouse suspicion. Also, it may not be the case that the 
adversary controls a machine at or near T. 

Alternatively, with knowledge of the location of the 
landmarks, the adversary can compute the geographic 
distances g;; and g;, between each landmark L; and the 
true location t as well as the forged location 7. This en- 
ables the adversary to determine the additional distance 
a probe from L,; would travel (7; = gi7 — git) had it ac- 
tually been directed to the forged location 7. The next 
challenge is to map 7; into the appropriate amount of de- 
lay to add. To do this, the adversary may use 2/3 the 
speed of light in a vacuum (c) as a lower-bound approx1- 
mation for the speed of traffic on the Internet [14]. Thus, 
the required delay to add to each ping from J; is: 


2x Vi 


= 
2/3 XC 


(1) 
The additional distance the ping from L; would travel is 
multiplied by 2 because the delay measured by ping 1s 
the round-trip time as opposed to the end-to-end delay. 
This approximation is the lower bound on the delay that 
would be required for the ping to traverse the distance 
2 x y; because the speed of light propagation is the fastest 
data can travel between the two points. 

Armed with this approximation of the appropriate d;, 
for each landmark, the adversary can now increase the 
delay of each probe from the landmarks. The perceived 
delay cannot be decreased since this would require the 
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Figure 3: CDF of the distance the adversary tries to move 
the target. 


adversary to either increase the speed of the network path 
between ¢ and L;, or slow down probes from L,; during 
its calibration phase. Since the adversary cannot compro- 
mise the landmarks and does not control network paths 
that are not directly connected to one of her machines, 
she is not able to accomplish this. As a result, the adver- 
sary may only modify landmark delays that need to be 
increased (1.e., d;, > dj;;). For all other landmarks, she 
does not alter the delays. Thus, even with perfect know]l- 
edge of the delays d;,, neither a simple nor sophisticated 
adversary will be able to execute an attack perfectly on 
delay-based geolocation techniques. 


4.2 Evaluation 


We evaluate the effectiveness of our proposed attack 
against a simulator that runs the CBG algorithm pro- 
posed by Gueye et al. [12]. We collected measurement 
inputs for the algorithm using 50 PlanetLab nodes. Each 
node takes a turn being the target with the remaining 
49 PlanetLab nodes being used as landmarks. Figure | 
shows the locations of the PlanetLab nodes. Each tar- 
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Figure 4: CDF of error distance for the adversary when 
attacking delay-based geolocation using speed of light 
(SOL) or best line delay. 


get is initially geolocated using observed network delays. 
The target is then moved to 50 forged locations using the 
delay-adding attack, shown in Figure 2. We select 40 of 
the forged locations based on the location of US univer- 
sities and 10 based on the location of universities outside 
of North America. This results in a total of 2,500 at- 
tempted attacks on the CBG algorithm. 

In the delay adding attack, the adversary cannot move 
a target that is not within the same region as the land- 
marks into that region. For example, if the target is lo- 
cated in Europe, moving it to a forged location in North 
America would require reducing delay to all landmarks, 
which is not possible. This implies that if a geolocation 
provider wants to prevent the adversary from moving the 
target into a specific region, it should place their land- 
marks in this desired region. 

Figure 3 shows the CDF of the distances the adversary 
attempts to move the target. In North America, the tar- 
get is moved less than 4,000 km most of the time moved 
moved less than 1,379 km 50% of the time. Outside of 
North America, the distance moved consistently exceeds 
5,000 km. 

We evaluate the delay-adding attack under two cir- 
cumstances: (1) when the adversary knows exactly what 
delay to add (by giving the adversary access to the “best 
line” function used by the landmarks), and (2) when the 
adversary uses the speed of light (SOL) approximation 
for the additional delay. 


4.2.1 Attack effectiveness 


Since the adversary is only able to increase, and not de- 
crease, perceived delays, there are errors between the 
forged location, 7, and the actual location, 7, returned 
by the geolocation algorithm. To understand why these 
errors exist, consider Figure 5. The arcs labeled gj, go, 
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Figure 5: Attacking delay-based geolocation. 


and g3 are the circles drawn by 3 landmarks when ge- 
olocating the target. The region enclosed by the arcs is 
the feasible region, and the geolocation result is the cen- 
troid of that region. To move ¢ to 7, the adversary should 
increase the radii of gg and g3 and decrease the radius 
of g;. However, as described earlier, delay can only be 
added, meaning that the adversary can only increase the 
radii of gz and g3 to g5 and g3, respectively (shown by the 
dotted lines). Since the delay of g; cannot be decreased, 
this results in a larger feasible region with a centroid r 
that does not quite reach 7. We call the difference be- 
tween the geolocation result (7) and forged location (7) 
the error distance (€) for the adversary. The difference 
between the intended and actual direction of the move is 
the angle 0. 

We begin by evaluating the error distance, ¢. Figure 4 
shows the CDF of error for the adversary over the set of 
attempted attacks in our evaluation. Within North Amer- 
ica, an adversary using the speed of light approximation 
has a median error of 1,143 km. When the adversary has 
access to the best line function,their error decreases to 
671 km. As areference, 671 km is approximately half the 
width of Texas. This indicates that when moving within 
North America, it is possible for an adversary with ac- 
cess to the best line function to be successful in trying 
to move the target into a specific state. We note that 
three of the targets used in our evaluation were located 
in Canada. Using the speed of light approximation these 
Canadian targets are able to appear in the US 65% of the 
time. Using the best line function, they are able to move 
into the US 89% of the time. 

Outside of North America, the delay-adding attack has 
poor accuracy with a minimum error for the adversary of 
4,947 km. As a reference, the distance from San Fran- 
cisco to New York City is 4,135 km. Error of this magni- 
tude is not practical for an adversary attempting to place 
the target in a specific country. For the remainder of this 
section, we focus on attacks where the adversary tries 
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Figure 6: Error observed by the adversary depending on Figure 7: Error observed by the adversary depending on 


distance of their attempted move for the delay-adding at- 


tack. 


to move within North America because the error for the 
adversary is more reasonable. 


We next consider how the distance the adversary tries 
to move the target affects the observed error. Figure 6 
shows error for the adversary depending on how far the 
adversary attempts to move the target when using the 
speed of light approximation. Figure 7 shows the same 
data for an adversary with access to the best line func- 
tion. We note that the error observed by the adversary 
grows with the magnitude of the attempted move by the 
adversary. Specifically, for each 1 km the adversary tries 
to move the median error increases by 700 meters when 
she does not have access to the best line function. With 
access to the best line function, the median error per km 
decreases by 43% to 400 km. Thus, the attack we pro- 
pose works best when the distance between ¢ and 7 is 
relatively small and the error observed by the attacker 
grows linearly with the size of the move. 


Given the relatively high errors observed by the adver- 
sary, we next verify whether the adversary moves in her 
chosen direction. Figure 8 shows the CDF of 6, the dif- 
ference between the direction the adversary tried to move 
and the direction the target was actually moved. While 
lacking high accuracy when executing the delay-adding 
attack, the adversary is able to move the target in the gen- 
eral direction of her choosing. The difference in direction 
is less than 45 degrees 74% of the time and less than 90 
degrees 89% of the time. The attack where the adversary 
has access to the best line function performs better with 
a difference in direction of less than 45 degrees 91% of 
the time. 
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distance of their attempted move for the delay-adding at- 
tack when they have access to the best line function. 


4.2.2 Attack detectability 


We next look at whether a geolocation provider can de- 
tect the delay-adding attack and thus determine that the 
geolocation result has been tampered with. 


When CBG geolocates a target, 1t determines a feasi- 
ble region where the target can be located [12]. The size 
of the feasible region can be interpreted as a measure of 
confidence in the geolocation result. A very large region 
size indicates that there is a large area where the target 
may be located, although the algorithm returns the cen- 
troid. As we saw in Figure 5, the adversary, able only 
to add delay, can only increase the radii of the arcs and 
thus only increase the region size. As a result, the delay- 
adding attack always increases the feasible region size 
and reduces confidence in the result of the geolocation al- 
gorithm. We consider the region size computed by CBG 
before and after our proposed attack to determine how 
effective region size may be for detecting an attack. 


Figure 9 shows the region size for CBG when the 
delay-adding attack is executed in general, when the 
attack only attempts to move the landmark less than 
1,000 km, and where the adversary has access to the best 
line function. We observe that the region size becomes 
orders of magnitude larger when the delay-adding attack 
is executed. The region size grows even larger when the 
adversary uses the best line function. An adversary that 
moves the target less than 1,000 km is able to execute 
the attack without having much impact on the region size 
distribution. 

The region size grows in proportion to the amount of 
delay added. This explains why the adversary creates 
a larger region size when using the best line function, 
which adds more delay than the speed of light approxi- 
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sary attempts to move the target using the best line func- 
tion. 


mation. Figure 10 illustrates this case. As the adversary 
attempts to move the target further from its true location, 
the amount of delay that must be added increases. This 
in turn increases the region size returned by CBG. Thus, 
while there may be methods for adding delay that im- 
prove the adversary’s accuracy, they will only increase 
the ability of the geolocation provider to detect the at- 
tack. 


Given the increased region sizes observed when the 
delay-adding attack is executed, one defense would be to 
use a region size threshold to exclude geolocation results 
with insufficient confidence. Increased region sizes may 
be caused by an adversary adding delays, as we have ob- 
served or by fluctuations in the stochastic component of 
network delay. In either case, the geolocation algorithm 
observes a region that is too large for practical purposes. 
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Figure 11: CDF of region size for CBG before and after 
delay-adding, limited to points less than 1,000,000 km?. 


Suppose we discard all geolocation results with a region 
size greater than 1,000,000 km? (this is approximately 
the size of Texas and California combined). Figure 11 
shows the CDF of region size below this threshold. The 
adversary using the speed-of-light approximation will be 
undetected only 36% of the time. However, if the adver- 
sary attempts to move less than 1,000 km she will remain 
undetected 74% of the time. An adversary with access 
to the best line for each of the landmarks is more eas- 
ily detectable because of the larger region sizes that re- 
sult from the larger injected delays. With a threshold of 
1,000,000 km?, the adversary using the best line function 
will have her results discarded 83% of the time. Thus, 
using a threshold on the region size is effective for de- 
tecting attacks on delay-based geolocation except when 
the attacker tries to move the target only a short distance. 
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5 Topology-aware geolocation 


Delay-based geolocation relies on correlating measured 
delays with distances between landmarks. As we saw 
previously, these correlations or mappings are applied 
to landmark-to-target delays to create overlapping con- 
fidence regions; the overlap is the feasible region, and 
the estimated location of the target is its centroid. When 
inter-landmark delays and landmark-to-target delays are 
not similarly correlated with physical distances (e.g., due 
to circuitous end-to-end paths) the resulting delay-to- 
distance relationships to the target can deviate signifi- 
cantly from the pre-computed correlations. 

Topology-aware geolocation addresses this problem 
by limiting the impact of circuitous end-to-end paths; 
specifically, it localizes all intermediate routers in ad- 
dition to the target node, which results in a better es- 
timate of delays. Starting from the landmarks, the ge- 
olocation algorithm iteratively estimates the location of 
all intermediate routers on the path between the land- 
mark and the target. This is done solely based on 
single-hop link delays, which are usually significantly 
less circuitous than multi-hop end-to-end paths, enabling 
topology-aware geolocation to be more resilient to cir- 
cuitous network paths than delay-based geolocation. 

There are two previously proposed topology-aware 
geolocation methods, topology-based geolocation 
(TBG) [14] and Octant [30]. These methods differ 
in how they geolocate the intermediate routers. TBG 
uses delays measured between intermediate routers 
as inputs to a constrained optimization that solves 
for the location of the intermediate routers and target 
IP [14]. In contrast, Octant leverages a “geolocalization” 
framework similar to CBG [12], where the location of 
the intermediate routers and target are constrained to 
specific regions based on their delays from landmarks 
and other intermediate routers [30]. These delays are 
mapped into distances using a convex hull rather than a 
linear function, such as the best line in CBG to improve 
the mapping between distance and delay. 

Octant leverages several optimizations that improve its 
performance over other geolocation algorithms. These 
include: taking into account both positive and negative 
constraints; accounting for fixed delays along network 
paths, and decreasing the weight of constraints based 
on latency measurements. Wong et al. find that their 
scheme outperforms CBG, with median accuracies of 35- 
40 km [30]. In addition, the feasible regions returned by 
Octant are much smaller than those returned by CBG. 
They also observe that their scheme is robust even given 
a small number of landmarks with performance leveling 
off after 15 landmarks. 

When analyzing and evaluating attacks on topology- 
aware geolocation, we consider a generic geolocation 
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framework. Intermediate routers are localized using con- 
straints generated from latencies to adjacent routers. The 
target is localized to a feasibility region generated based 
on latencies from the last hop(s) before the target, and 
the centroid of the region is returned. 


5.1 Delay-based attacks on topology-aware 
geolocation 


Topology-aware geolocation systems localize all inter- 
mediate routers in addition to the target node. We begin 
by analyzing how a simple adversary, one without the 
ability to fabricate routers, could attack the geolocation 
system, and then move onto how a sophisticated adver- 
sary could apply additional capabilities to improve the 
attack. Since the simple adversary has no control over 
the probes outside her own network, any change made 
can only be reflected on the final links of the path to- 
wards the target. 

Most networks are usually connected to the rest of the 
Internet via a small number of gateway routers. Any path 
connecting nodes outside the adversary’s network to the 
target (which is inside the network) will go through one 
of these routers. Here, we start with a simple case where 
all routes towards the target converge on a single gate- 
way router; we then consider the more general case of 
multiple gateway routers. 


CLAIM: 1 Jf the network paths from the landmarks to 
the target converge to a single common gateway router, 
increasing the end-to-end delays between the landmarks 
and the target can be detected and mitigated by topology- 
aware geolocation systems. 


To verify this claim, we first characterize the effect 
of delay-based attacks on topology-aware geolocation. 
Delay-based attacks selectively increase the delay of the 
probes from landmarks. The probe from landmark L; 
is delayed for an additional 0; seconds. Given that all 
network paths to the target converge to a single common 
gateway router h, the end-to-end delay from each land- 
mark, L;, to the target can be written as: 


Oi day Pie 0; (2) 


The observed latency from the gateway to the target is 
djz — dj,, which is the sum of the real last-hop latency 
and the attack delay. However, since the delay-based at- 
tack relies on selectively varying the attack delays, 0;, 
based on the location of L;, the observed last-hop latency 
between the gateway and the target will be inconsistent 
across measurements initiated from different landmarks. 

The high-variance in the last-hop link delay can be 
used to detect delay-based attacks in topology-aware ge- 
olocation systems. The attack can be mitigated by taking 
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the minimum observed delay for each link. The resulting 
observed link delay from h to the target is: 


dnt = dnt + pe 0; (3) 


This significantly reduces the scope of delay-based at- 
tacks, requiring attack delays to be uniform across all 
measurement vantage points when there is only a single 
common gateway to the target. 

In general, if there are multiple gateway routers on the 
border of the adversary’s network, we can make the fol- 
lowing weaker claim: 


CLAIM: 2 Increasing the delay between each gate- 
way and the target can only be as effective against 
topology-based geolocation as increasing end-to-end de- 
lays against delay-based geolocation with a reduced set 
of landmarks. 


An adversary could attempt to modify delays between 
each gateway router, );, and the target, ¢. This assumes 
the adversary knows the approximate geolocation results 
for all gateway routers *. Where there is only a single 
gateway router with no additional attack delay, topology- 
based geolocation places the target within a circle cen- 
tered at h with coordinates (;,, d7,): 


V(x — An)? + (y — bn)? = dnt (4) 


Subjecting the latency measurement to an additional de- 
lay, 0, changes the equation to the following: 


(x — An)? + (y — bn)? = dae + 6 (5) 


Thus, for targets with a single gateway router, an adver- 
sary can only increase the localization region by intro- 
ducing an additional delay without changing the location 
of the region’s geometric center. 

For targets with multiple gateway routers H = 
ho, h1,..., hn, targets are geolocated based on the de- 
lays between the gateways and t. An adversary can 
add additional delay, 0;, between each gateway, h,;, and 
t based on the location of h;. This is equivalent to 
the delay-adding attack, except the previously geolo- 
cated gateway routers are used in place of the real land- 
marks. Therefore, the previous evaluation results for the 
delay-adding attack on delay-based geolocation can be 
extended to topology-based geolocation for targets with 
multiple gateway routers. 


5.2 Topology-based attacks 


In topology-based geolocation, intermediate nodes are 
localized to confidence regions, and geographic con- 
straints constructed from these intermediate nodes are 
expanded by their confidence regions to account for the 
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accumulation of error. However, this does not result in 
a monotonic increase in the region size of intermediate 
nodes with each hop. The intersection of several ex- 
panded constraints for intermediate nodes along multiple 
network paths to the target can still result in intermedi- 
ate nodes that are localized to small regions. A sophisti- 
cated adversary with control over a large administrative 
domain can exploit this property by fabricating nodes, 
links and latencies within its network to create constraint 
intersections at specific locations. This assumes that the 
adversary can detect probe traffic issued from geoloca- 
tion systems in order to present a topologically different 
network without affecting normal traffic. 

Externally visible nodes in an adversary’s network 
consist of gateway routers ER = {ero,eri,...,e€m}; 
internal routers F = { fo, fi,..-, fn} and end-points 
T = {7 ,71,.--,;Ts}. Internal routers can be fictitious, 
and network links between internal routers can be arbi- 
trarily manufactured. The adversary’s network can be de- 
scribed as the graph G = (V, EF’), where V = FUERUT 
represents routers, and F = {€o, €1,..., ex } with weights 
w(e;) is the set of links connecting the routers with 
weights representing network delays. 

All internal link latencies, including those between 
gateways, can be fabricated by the adversary. How- 
ever, the delay between fictitious nodes must respect the 
speed-of-light constraint, which dictates that a packet can 
only travel a distance equal to the product of delay and 
the speed-of-light in fiber. 


CLAIM: 3 Topology-based attacks require the adversary 
to have more than one geographically distributed gate- 
way router to its network. 


This claim follows from the analysis of delay-based at- 
tacks when all network paths to the target converge to a 
common gateway router. With only one gateway router 
to the network, changes to internal network nodes can af- 
fect only the final size of the localization region, not the 
region’s geometric center. 


CLAIM: 4 An adversary with control over three or more 
geographically distributed gateway routers to its network 
can move the target to an arbitrary location. 


Unlike delay-based attacks that can only increase laten- 
cies from the landmarks to the target, topology-based 
attacks can assign arbitrary latencies from the ingress 
points to the target. From geometric triangulation, this 
enables topology-based attacks to, theoretically, triangu- 
late the location of the target to any point on the globe 
given three or more ingress points. 

In practice, there are challenges that limit the adver- 
sary from achieving perfect accuracy with this attack. 
Specifically, the attack requires the adversary to know the 
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estimated location of the gateway routers and to have an 
accurate model of the delay-to-distance function used by 
the geolocation system. Such information can be reverse- 
engineered by a determined adversary by analyzing the 
geolocation results of other targets in the adversary’s net- 
work. 

Although a resourceful adversary’s topology-based at- 
tack can substantially affect geolocation results, it can 
also introduce additional circuitousness to all network 
paths to the target that creates a detectable signature. Cir- 
cuitousness refers to the ratio of actual distance traveled 
along a network path to the direct distance between the 
two end points of a path. Circuitousness can be observed 
by plotting the location of intermediate nodes as they are 
located by the topology-aware geolocation system. 


5.2.1 Naming attack extension 


State-of-the-art, topology-based geolocation  sys- 
tems [14,30] leverage the structured way in which most 
routers are named to extract more precise information 
about router location. A collection of common naming 
patterns is available through the undns tool [27], which 
can extract approximate city locations from the domain 
names of routers. 

When geolocation relies on undns, an adversary can 
effectively change the observed location of the target 
even with only a single gateway router to its network. 
This naming attack requires the adversary is capable of 
crafting a domain name that can deceive the undns tool, 
poisoning the undns database with erroneous mappings 
or responding to traceroutes with a spoofed IP address. 
The adversary only needs to use the naming attack to 
place any last hops before the target at its desired geo- 
graphic location. The target will then be localized to the 
same location as this last hop in the absence of sufficient 
constraints. 

Naming attacks exhibit the same increased circuitous- 
ness as standard topology-based attacks. Extensive poi- 
soning of the undns database could allow an attacker to 
change the location of other routers along the network 
paths to reduce path circuitousness. 


5.3. Evaluation 


We evaluate the topology-based (hop-adding) attack and 
undns naming extension using a simulator of topology- 
aware geolocation. To perform the evaluation, we de- 
veloped the fictitious network illustrated in Figure 12. 
The network includes 4 gateway routers (ER), repre- 
sented by PlanetLab nodes in Victoria, BC; Riverside, 
CA; Ithaca, NY, and Gainesville, FL. The network also 
includes 11 forged locations (7’) and 14 non-existent in- 
ternal routers (/’). Three of the non-existent routers are 
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Figure 12: The adversary’s network used for evaluating 
the topology-based attack. 


geographically distributed around the US, while the other 
11 are placed close to the forged locations to improve 
the effectiveness of the attack, especially when the ad- 
versary can manipulate undns entries. Routers in the fic- 
titious network are connected using basic heuristics. For 
example, each of the 11 internal routers near the forged 
locations is connected to the 3 routers nearest them to 
aid in triangulation. We show that even using this simple 
network design, an adversary executing the hop-adding 
attack and undns extension can be successful. 


To evaluate the attack, we use the same set of 50 Plan- 
etLab nodes used in evaluating the delay-adding attack 
(Figure 1), with an additional 30 European PlanetLab 
nodes that act only as targets attempting to move into 
North America. We move the targets to the 11 forged lo- 
cations in the fictitious network. These locations, a sub- 
set of the 40 US locations used in evaluating the delay- 
adding attack, were chosen to be geographically dis- 
tributed around the US. Each of the 80 PlanetLab nodes 
takes a turn being the target with the remaining US Plan- 
etLab nodes used as landmarks. Each target is moved to 
each of the 11 forged locations in turn, for a total of 880 
attacks. 


When executing the attack, the traceroute from each 
landmark is directed to its nearest gateway router. The 
first part of the traceroute is dictated by the network 
path between the landmark and its nearest gateway router 
(represented by a PlanetLab node). The second part is 
artificially generated to be the shortest path between the 
gateway router and the forged location. The latency of 
the second part is lower bounded by the speed-of-light 
delay between the gateway router and the target’s true 
location. When the speed-of-light latency between the 
gateway router and the target is greater than the latency 
on the shortest path from the gateway to the forged lo- 
cation, the additional delay is divided across links in the 
shortest path. 
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Figure 14: Error observed by the adversary depending 
on how far they attempt to move the target using the 
topology-based attack. 
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Figure 13: CDF of error distance for the attacker when 
executing the topology-based and undns attacks. 


5.3.1 Attack effectiveness 


We begin by examining how accurate the adversary can 
be when attempting to move the target to a specific 
forged location. Figure 13 shows the error for the ad- 
versary when executing the topology-based attack and 
undns extension. Without the uwndns extension, the ad- 
versary 1s able to place a North American target within 
680 km of the false location 50% of the time. This is sim- 
ilar to the delay-adding attack in which the adversary has 
access to the best line function. When moving a target 
from Europe to North America, the adversary’s median 
error increases by 50% to 929 km. Despite this increase, 
we observe that the adversary succeeds in each attempt 
to move a European target into the US. In addition to 
the overall decrease in accuracy for the adversary, we 
note that there are some instances where the target in Eu- 


19th USENIX Security Symposium 


180 





90-percentile cree spre 


160 median -------- 
10-percentile —*— 


_ 140 

E 

<~ 120 

© 

S 100 

& 

. 80 

S 

Ss 60 

° 40 
20 





0 
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 
distance of attempted move (km) 


Figure 15: Error observed by the adversary depending on 
how far they attempt to move the target using the undns 
attack. 


rope misleads the algorithm with higher accuracy. This 
is caused by the adversary using the speed-of-light ap- 
proximation for latencies within their network. Since the 
speed-of-light is the lower bound on network delay, when 
additional delay is added to the links to account for the 
time it would take a probe to reach the target in Europe, 
the delay approaches the larger delay expected by the 
landmarks’ distance-to-delay mapping. The undns ex- 
tension increases the adversary’s accuracy by 93%, with 
the adversary locating herself within 50 km of the forged 
location 50% of the time. These results are consistent 
whether the true location of the target is in North Amer- 
ica or Europe. 


When analyzing the delay-adding attack, we observed 
a linear relationship between the distance the adversary 
attempts to move the target and the error she observes. 
Figures 14 and 15 show the 10th percentile, median and 
90th percentile error for the attacker depending on how 
far the forged location is from the target for the topology- 
based attack and undns extension, respectively. The ob- 
served errors were quite erratic which is a result of the 
many other factors that affect the accuracy of geolocation 
beyond the distance of the attempted move. In general, 
error for the adversary increases slowly as the adversary 
tries to move the target longer distances. This enables an 
adversary executing the topology-based attack to move 
the target longer distances. Error for the adversary using 
the undns extension remains fairly constant regardless of 
how far they attempt to move the target. In the case of the 
undns attack, the median accuracy fluctuates by less than 
60 km whether the adversary moves 500 km or 4,000 km. 
The slow growth of adversary error stems from the en- 
gineered delays in the fictitious network. These delays 
cause nodes along the paths (including the end point) to 
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Figure 16: CDF of change in direction for the topology- 
based and undns extension. 


be geolocated to a similar location regardless of where 
the target location was originally located. 

We next confirm that the adversary is able to move in 
her chosen direction. Figure 16 shows the difference be- 
tween the direction the adversary tried to move the target 
and the direction the target was actually moved (6 in the 
delay-adding attack). For the general topology-based at- 
tack, the adversary is within 36 degrees of her intended 
direction 75% of the time and within 69 degrees 90% of 
the time. This improves with the uwndns extension where 
the adversary is within 3 degrees of their intended direc- 
tion 95% of the time. When the target attempts to move 
from Europe to North America, they always move very 
close to their chosen direction. The adversary always is 
within 10 degrees of her chosen direction. The smaller 
change in direction for European nodes stems from the 
longer distance between the target and the forged loca- 
tion. This causes a smaller change in direction to be ob- 
served for similar error values compared to a target that 
is closer to the forged location. 


5.3.2 Attack detectability 


We have observed that an adversary executing the 
topology-based attack and the undns extension to the at- 
tack can accurately relocate the geolocation target. We 
next consider whether the victim would be able to detect 
these attacks and reduce their impacts on geolocation re- 
sults. 

Figure 17 shows the region sizes for topology-aware 
geolocation and undns geolocation before and after the 
attacks are executed (for both North America and Eu- 
ropean targets). Unlike the delay-adding attack, the ad- 
versary that adds hops to the traceroutes of the victim 
has region sizes similar to the original algorithms and, 
in some cases, even smaller region sizes. For topology- 
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Figure 17: CDF of region size before and after the 
topology-based attack and undns extension. 


aware geolocation, we observe median region sizes of 
102,273 km? before and 50,441 km? after the attack. For 
the undns extension, we observe median region sizes of 
4.448 km? before and 790 km? after the attack. These re- 
sults indicate that region size is a poor metric for ruling 
out attacks that add hops to the end of traceroute paths. 


Another metric that may be used to rule out geoloca- 
tion results that have been modified by an adversary is 
path circuitousness. We define circuitousness of a tracer- 
oute path between landmark, L,;, and the target as fol- 
lows, where r = (\,., @,-) is the location returned by the 
geolocation algorithm, and h; = (A,;, ¢;) is the location 
of intermediate hop 7 as computed by the geolocation al- 
gorithm: 


Cif. se BOP 5 nhs Oba 
C= ho a hy h (6) 


Figure 18 shows the distribution of circuitousness for 
paths between each landmark and the target for topology- 
aware geolocation before and after the topology-based 
attack is executed’. We observe that when the topology- 
based attack is executed the circuitousness per landmark 
increases. One criterion a geolocation algorithm can 
use for discarding results from the topology-based at- 
tack would be to discard results from landmarks where 
the circuitousness is abnormally high. If a geolocation 
framework that assigns weights to constraints, such as 
Octant, is used, constraints from landmarks with high 
circuitousness could be given a lower weight to limit the 
adversary’s effectiveness. We note that a clever adver- 
sary could design her network to use more direct paths, 
making it more difficult to detect the attack by observing 
circuitousness. 
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Figure 18: CDF of circuitousness for each landmark be- 
fore and after the topology-based attack. 


6 Related work 


While there have been many related works on developing 
and evaluating geolocation algorithms (e.g., [12, 14, 26, 
30]), there has been limited study of IP geolocation given 
a non-benign target [5, 18]. 

Castelluccia et al. consider the application of 
CBG [12] to the problem of geolocating hidden servers 
hosting illegal content within a botnet [5]. The technique 
used to hide these servers is referred to as “fast-flux’’, 
where a constantly changing set of machines infected by 
a botnet is used to proxy HTTP messages for a hidden 
server. Geolocating these servers is important to enable 
the appropriate authorities to take action against them. 
Castelluccia et al. leverage the fact that the hidden server 
is behind a layer of proxies to factor out the portion of 
the observed RTT caused by the proxy layer. They use 
HTTP connections to measure RTTs (because the hidden 
servers are unlikely to respond to ping _) and factor out 
additional delay caused by the layer of proxies to geolo- 
cate hidden servers with a median error of 100 km using 
PlanetLab nodes as ground truth hidden servers. 

Muir and Oorschot survey a variety of geolocation 
techniques and their applicability in the presence of an 
adversarial target [18]. Their work is similar to but dis- 
tinct from ours. Specifically, they emphasize geolocation 
techniques that leverage secondary sources of informa- 
tion, such as whois registries based on domain, IP and 
AS; DNS LOC [8]; application data from HTTP head- 
ers, and data inferred from routing information. They 
consider delay-based geolocation but do not specify or 
evaluate any attacks on measurement-based geolocation. 
Muir and Oorschot discuss the limitations of IP geolo- 
cation when an adversary attempts to conceal her IP ad- 
dress through the use of an anonymization proxy and ex- 
amine how a Web page embedding a Java applet can dis- 
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cover a client’s true identity using Java’s socket class to 
connect back to the server. They demonstrate this strat- 
egy for identifying clients using the Tor [28] anonymiza- 
tion network. 

These previous works begin to consider the perfor- 
mance of geolocation algorithms when the target of ge- 
olocation may have incentive to be adversarial. However, 
they generally focus on the issue of geolocating hosts that 
attempt to deceive geolocation using proxies. In con- 
trast, we develop and evaluate attacks on two classes of 
measurement-based geolocation techniques by manipu- 
lating the network properties on which the techniques 
rely. 

We observe that the problem of geolocating an adver- 
sarial target is similar to the problem of secure position- 
ing [4] in the domain of wireless networks. Unlike wire- 
less signals, network delay is subject to additive noise 
as a result of congestion and queuing along the network 
path as well as circuitous routes. Multiple hops along 
network paths on the Internet and the existence of large 
organizational WANs also enable new adversarial mod- 
els in the domain of IP geolocation. 


7 Conclusions 


Many applications of geolocation benefit from security 
guarantees when confronted with an adversarial target. 
These include popular applications, such as limiting me- 
dia distribution to a specific region, fraud detection, and 
newer applications, such as ensuring regional regulatory 
compliance when using an infrastructure as a service 
provider. This paper considered two models of an adver- 
sary trying to mislead measurement-based geolocation 
techniques that leverage end-to-end delays and topology 
information. To this end, we developed and evaluated 
two attacks against delay-based and topology-aware ge- 
olocation. 

To avoid detection, adversaries can leverage inherent 
variability in network delay and circuitousness of net- 
work paths on the Internet to hide their tampering. Since 
these properties are measured and used by various geolo- 
cation techniques, they serve as good attack vectors by 
which the adversary can influence the geolocation result. 

Our most surprising finding is that the more advanced 
and accurate topology-aware geolocation techniques are 
more susceptible to covert tampering than the simpler 
delay-based techniques. For geolocation algorithms that 
leverage delay, we observed how a simple adversary that 
only adds delay to probes could alter the results of ge- 
olocation. However, this adversary has limited precision 
when attempting to forge a specific location. We also 
observed a clear trade-off between the amount of delay 
an adversary added and her detectability, using the re- 
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gion size returned by CBG [12] as a metric for discarding 
anomalous results. 

Compared to delay-based geolocation, topology- 
aware geolocation fares no better against a simple adver- 
sary and worse against a sophisticated one. Topology- 
aware geolocation uses more information sources, such 
as traceroute and undns , to achieve higher accuracy than 
delay-based geolocation. Unfortunately, this advantage 
becomes a weakness against an adversary able to corrupt 
these sources. A sophisticated adversary that can lever- 
age multiple network entry points (e.g., an infrastructure 
as a service provider) can cause the geolocation system to 
return a result as accurate as the best case simple adver- 
sary without increasing the resultant region size. When 
undns entries are corrupted, the adversary is able to forge 
locations with high accuracy without increasing the re- 
gion sizes — in some cases, even decreasing them. 

Our work reveals limitations of current measurement- 
based geolocation techniques given an adversarial target. 
To provide secure geolocation, these algorithms must ac- 
count for the presence of untrustworthy measurements. 
This may be in the form of heuristics to discount mea- 
surements deemed untrustworthy or through the use of 
secure measurement protocols. We intend to explore 
these directions in future work. 
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Notes 


'Tn reality, the consumer of geolocation information will likely con- 
tract out geolocation services from a third party geolocation provider 
that will maintain landmarks. Given the common goals of these two 
entities we model them as a single party. 

*The adversary can assume that the gateway routers are geolocated 
to their true locations. 

>We make similar observations for the wndns attack extension. 
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Abstract 


Idle port scanning uses side-channel attacks to bounce 
scans off of a “zombie” host to stealthily scan a vic- 
tim IP address or infer IP-based trust relationships be- 
tween the zombie and victim. We present results from 
building a transition system model of a network proto- 
col stack for an attacker, victim, and zombie, and testing 
this model for non-interference properties using model 
checking. Two new methods of idle scans resulted from 
our modeling effort, based on TCP RST rate limiting and 
SYN caches, respectively. Through experimental verifi- 
cation of these attacks, we show that it is possible to scan 
victims which the attacker is not able to route packets to, 
meaning that protected networks or ports closed by fire- 
wall rules can be scanned. This is not possible with the 
one currently known method of idle scan in the literature 
that is based on non-random IPIDs. 

For the future design of network protocols, a notion 
of trusted vs. untrusted networks and hosts (based on 
existing IP-based trust relationships) will enable shared, 
limited resources to be divided. For a model complex 
enough to capture the details of each attack and where 
a distinction between trusted and untrusted hosts can be 
made, we modeled RST rate limitations and a split SYN 
cache structure. Non-interference for these two resources 
was verified with symbolic model checking and bounded 
model checking to depth 1000, respectively. Because 
each transition is roughly a packet, this demonstrates that 
the two respective idle scans are ameliorated by separat- 
ing these resources. 


1 Introduction 


Network reconnaissance is the important first step of vir- 
tually all network attacks. By scanning the network, the 
attacker is able to gain valuable information about the 
hosts that exist and the services they offer, infer IP-based 
trust relationships between hosts that are enforced by 
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firewall rules and router tables, and collect other infor- 
mation that they can use in the next stage of attack. In 
this paper, we show that model checking can be a useful 
framework for predicting and mitigating attacker capa- 
bilities. In idle scans, an attacker scans a victim with- 
out sending packets to that victim using its own return 
IP address. The model of idle scans that we describe in 
this paper led to the discovery of two new forms of idle 
scan!, One of these, based on SYN cache structures that 
are common to all modern network stacks, gives an at- 
tacker capabilities beyond the one currently known form 
of idle scan in the literature. We demonstrate that it is 
possible to infer the liveness of hosts and some informa- 
tion about what operating system they are running on a 
subnetwork without the ability to route packets to that 
network. We also demonstrate that it is possible to port 
scan a network on a port that the firewall protecting the 
network blocks. This means that if, e.g., a particular port 
is blocked by a firewall for an entire subnetwork, an at- 
tacker can scan the hosts on that subnetwork on that port 
from outside the firewall using idle scans. Finally, we 
demonstrate that if a distinction between trusted and un- 
trusted hosts were made explicit in the lower layers of 
the network protocol stack, then separate RST rate limi- 
tations and a split SYN cache structure eliminates these 
attacks in our model of network stacks, which is complex 
enough to model all of the details of each attack. 


The two new forms of idle scan that have resulted 
from the model checking effort presented in this paper 
are based on RST rate limiting and SYN caches, re- 
spectively. These were discovered during the process of 
building the model and manifest as counterexamples to a 
non-interference property that are produced by the model 
checker. In the RST rate limiting counterexample, the 
zombie in this case is a FreeBSD machine that limits the 
number of RST packets that it will send in a given time 
period. The attacker can infer the port status of the vic- 
tim by testing the rate at which the zombie will reply with 
RST packets, the details of this are in Section 4. 
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The SYN cache counterexample is different from 
the existing IPID-based idle scan, which is described 
in Section 2, in that the attacker never sends pack- 
ets to the victim, not even forged packets. Instead the 
attacker forges SYN packets from the victim to the zom- 
bie, and the zombie sends a SYN/ACK to the victim and 
places these SYN packets in its SYN cache (a data struc- 
ture for holding half-open TCP connections for which a 
SYN/ACK has been sent but an ACK response has yet 
to arrive). Because RSTs and ICMP errors from the vic- 
tim will cause this SYN cache entry to be removed, the 
attacker can effectively perform a SYN/ACK scan of the 
victim without needing the ability to route packets to the 
victim. The attacker does this by testing the state of the 
SYN cache by sending SYNs with its own return IP ad- 
dress and viewing the SYN/ACK responses. The replies 
of the victim probes can be inferred from the attacker’s 
ability to get SYN cache entries for its own SYNs. This 
makes possible testing for the liveness of IP addresses on 
protected networks with a rudimentary form of OS detec- 
tion, and even port scanning on certain types of hosts on 
a port that is entirely blocked by a firewall. More details 
are given in Section 4. 


Like virtually all side-channel attacks, idle scans are 
associated with shared, limited resources. Because these 
resources generally cannot be made unlimited, we rec- 
ommend in light of our results that trust relationships 
between hosts be made explicit to those hosts all the 
way down to the IP layer. Currently the only distinc- 
tion at the TCP and IP layers is subnetworks, which do 
not necessarily correspond to the IP-based trust relation- 
ships between hosts that are enforced by firewall rules 
and routing tables. Trusted hosts can be hosts protected 
by the same firewall or that have special trust relation- 
ships in the packets they can route to each other. By 
making a distinction between trusted and untrusted hosts 
non-interference can be achieved by statically dividing 
shared resources, effectively eliminating idle scans. We 
verify non-interference for our model with separate RST 
rate limitations using symbolic model checking. Then 
we demonstrate that our split SYN cache structure using 
bounded model checking to a depth of 1000 transitions 
has no violations of non-interference. This means that 
no practical attack for this counterexamples exists within 
the constraints of our model. 


This paper is organized as follows. Section 2 gives 
more background and related works. Our model is 
described in Section 3, followed by a description of 
the counterexamples discovered during the process of 
building the model and some experimental results from 
their implementation in Section 4. We demonstrate that 
non-interference is achievable by distinguishing between 
trusted and untrusted hosts in Section 4.3. This is fol- 
lowed by discussion and future work in Section 5. 
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2 Background and related work 


Because network reconnaissance is an important first 
Step in most network attacks, a fair amount of previous 
work has focused on detecting port scans. Staniford ef 
al. [32] use simulated annealing to detect stealthy scans. 
Leckie and Kotagiri [18] present a probabilistic approach 
for detecting port scans, and Muelder et a/. [23] propose 
a visualization approach. The scan behavior of Inter- 
net worms has been studied [29, 36, 12, 35], as has the 
scan detection problem at the backbone level [31, 30] and 
measurements of port scans and their side effects at Inter- 
net telescopes [24]. Jung ef a/. [14] describe an approach 
based on sequential hypothesis testing. Gates [9, 10] 
and Kang ef al. [15] consider the problem of stealth port 
scans based on using many distributed hosts (e.g., a bot- 
net) to perform the scan. To our knowledge, ours is the 
first study to model idle scans, which are a distinct stealth 
technique that, in addition to being used for stealth, can 
also can be used for inferring IP-based trust relation- 
ships. Passively identifying hosts that have no routable 
IP address and are hidden by network address transla- 
tion [2, 17] is a related problem to idle scans, but as- 
sumes a very different threat model where some amount 
of traffic can be viewed passively by the attacker. 

Idle scans were introduced by Antirez [1] in a 1998 
posting to the bugtraq mailing list. The one currently 
known form of idle scan, based on non-random, sequen- 
tial IPIDs of older network stacks, was also described in 
this posting and is described in more detail by Lyon [20]. 
An IPID is a unique identifier for each IP packet, used 
primarily for IP fragmentation. In early implementations 
of the IP protocol, the IPID was chosen sequentially by 
simply incrementing the IPID value for each packet. An- 
tirez showed that this made it possible to perform an idle 
scan of a victim by using a third host, the zombie, which 
the attacker need not have control of, in a form of side- 
channel attack. 


In this form of idle scan, the attacker queries the zom- 
bie for IP packet responses and observes the sequence of 
IPIDs in the zombie’s responses. The attacker then sends 
one or more SYN packets to the victim on the target port 
to be scanned with the return IP address of the zombie 
and return port of a closed port on the zombie. If the vic- 
tim replies to the SYN with a SYN/ACK, meaning the 
victim’s port is open, then the zombie will reply to the 
victim with a TCP reset (RST) and the attacker will ob- 
serve a discontinuity in the sequence of IPIDs that it re- 
ceives from the zombie. If the victim port is closed, the 
SYN is dropped or replied to by the victim with a RST, 
which the zombie simply drops and no discontinuity is 
observed by the attacker. Thus, the attacker is able to 
infer the port status of the victim without revealing their 
return IP address to the victim. Furthermore, the attacker 
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is able to infer trust relationships between the victim and 
zombie. For example, the attacker might infer that the 
victim only accepts connections from a particular trusted 
subnetwork by using a zombie on that subnetwork. This 
is the one known form of idle scan in the current liter- 
ature. Modern network stacks randomize the IPID for 
security reasons not related to idle scans, so the zombie 
must be an older system for this known type of idle scan 
to work. 


IPID-based idle scans have been implemented in 
nmap [20] using an algorithm that accounts for interfer- 
ence from other hosts and packet loss. Lyon [20] is a 
good resource for how this type of idle scan works in 
detail and the different uses of idle scans. FIP bounce 
scans [20] are currently the only known way to port scan 
a victim host or network without routing forged packets 
to that host or network from the attacker. These scans use 
a feature that has been largely discontinued in FTP server 
implementations because of the many potential abuses of 
it. FTP bounce scans require that the attacker be able to 
log into an FTP session on the zombie, and operate at 
layer 7 (the application layer) of the OSI network model, 
whereas the SYN cache idle scan that we describe oper- 
ates at layers 3 and 4 (TCP/IP) and requires only that the 
attacker be able to route SYN packets to an open port on 
the zombie. 

Non-interference [11] is a widely used concept of in- 
formation flow security that has seen wide application for 
proving security properties of programs. The works that 
are most related to ours in this space are those that treat 
non-interference as two or more separate scenarios that 
must produce the same result from the attacker’s view for 
non-interference to be demonstrated, e.g., TightLip [38] 
or the work of McCamant and Ernst [21, 22]. We apply 
non-interference to network stacks in this paper. Non- 
interference proved to be a very fruitful model of infor- 
mation flow in this context, but for future work that might 
consider packet loss, packet delay, and other such fac- 
tors, alternatives such as non-deducibility [33] may be 
necessary. For the modeling effort presented in this pa- 
per, which is based on an abstracted model of real net- 
works that does not include packet loss and delay, non- 
interference proved to be a very useful property because 
it can be specified with Linear Temporal Logic (LIL). 
Treating the problem as a covert channel problem and 
studying object storage [16] and timing channels [37] is 
an attractive approach, but covert channel models assume 
collusion of the sender and receiver of information and 
do not capture in their models the sequences of events 
necessary to describe idle scans in a natural way. 

The model checker that we chose for our study is the 
Symbolic Analysis Laboratory (SAL) [3]. SAL pro- 
vides a SAT-based bounded model checker that allows 
for counterexamples to be easily interpreted as a trace 
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through the states of the model, or, in our case specit- 
ically, a sequence of packets. SAL also provides a 
BDD-based symbolic model checker. Model checking 
has been applied to many properties of network proto- 
cols and their implementations where specific bugs lead 
to security vulnerabilities or availability issues, (e.g., 
[8, 25, 13]). We have particularly patterned our analysis 
following Rushby’s tutorials for modeling the Needham- 
Schroeder protocol [26] to identify Lowe’s bug and the 
fault-tolerant algorithm for maintaining interactive con- 
sistency (Byzantine agreement) [27] as the transition 
systems for these problems seem similar to the ones 
for modeling port scanning and side-channel attacks in 
a protocol stack. Our results demonstrate that model 
checking is also useful for studying information flow on 
networks, particularly in this paper within the context of 
idle port scans. 


3 Formalizing non-interference analysis of 
idle scans 


In this section, we first describe the basics of our net- 
work stack model, then we describe more details and its 
implementation in SAL, and finally we list simplifying 
assumptions of the model. 


3.1 Modeling the network stack 


A host is viewed to be at the end of the network, i.e., an 
end host. Hosts have internal state, such as aSYN cache, 
RST rate limit variables, and receive buffers. Hosts also 
have ports, which can be open, closed or filtered and their 
Status does not change. An open port is one that the host 
will accept an incoming TCP connection on. For UDP, 
open ports simply drop packets and closed ports send 
ICMP errors. Filtered ports behave as would a typical 
host, but for the results presented in this paper all ports 
are either open or closed and never filtered. Hosts reply 
to packets based on rules that model a typical Linux or 
FreeBSD network stack. Our model is based on the IP 
protocol and includes TCP (but only up to the point of 
half-open connections), ICMP, and UDP. 

The SYN cache on a host is a cache for pending SYN 
packets for which a SYN/ACK has been sent and the host 
is waiting for an ACK to complete the TCP three-way 
handshake. The SYN cache drops duplicate SYNs for 
the same IP address and port pairs. In our model packets 
are only removed from the SYN cache when a TCP RST 
is received from the source IP address and port of the 
original SYN packet (because we only model half-open 
TCP/IP connections, so there is no ACK for the third part 
of a three-way TCP handshake). When the SYN cache 
is full, the host replies with a SYN cookie and drops the 
SYN. A SYN cookie is a method for sending an initial 
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Sequence number in the SYN/ACK that, when ACKed 
by the remote host, contains enough information to com- 
plete the connection so that no state about the half-open 
connection needs to be kept in memory [4]. TCP RST 
rate limiting, where the number of resets sent by a host is 
limited, is based on the FreeBSD implementation where 
separate rate limits are maintained for open ports and 
closed ports. 





Figure 1: Basic definition of an idle scan. 


Figure | shows the basic definition of an idle scan that 
we use for our model. There are four IP addresses, three 
for the attacker, zombie, and victim hosts, and one where 
there is no host so all packets are dropped. A solid arrow 
denotes that the source host can send packets to the des- 
tination using its own return IP address. A dashed arrow 
indicates that the source can send a packet to the destina- 
tion using any return IP address other than its own. The 
salient feature of this definition of an idle scan is that the 
attacker cannot send packets to the victim using its own 
return IP address. This entails that the victim never sends 
any packets to the attacker, and that the attacker there- 
fore only ever receives packets from the zombie, since 
the victim and zombie only ever reply to packets using 
their real IP address as the return address. 

Our goal is to ensure that the network satisfies the non- 
interference property, which is specified as: for any pos- 
sible sequence of packets that the attacker can send to the 
victim and zombie, the sequence of packets the attacker 
receives in response is identical regardless of whether the 
target victim port is open or closed. This models the de- 
sired behavior that the attacker cannot gain any informa- 
tion about the target victim’s port. 

We do this by modeling two possible scenarios faced 
by the attacker which the attacker is attempting to distin- 
guish and thus gain information about the victim. In each 
scenario, there is a victim and a zombie whose behavior 
and initial state are identical (except for the status of the 
target port on the victim, of course), but whose behavior 
and internal state over time can differ between scenarios 
through certain sequences of events due to the port status 
of the target port. In one scenario, the target port of the 
victim is open whereas in the second scenario, the tar- 
get port is closed. The attacker sends identical packets in 
both scenarios. 
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Attacker 


Figure 2: Overview of our model (the IP address with 
no host that drops all packets is excluded from this 
figure for clarity). 


Figure 2 gives an overview of our model for testing 
non-interference properties of network stacks for idle 
scans. The status of the target port being open or closed 
is modeled as two different scenarios. Victim | and Zom- 
bie 1, for example, exist in scenario | where the target 
port on Victim | is open. Victim 2 and Zombie 2 exist 
in scenario 2 where the target port on Victim 2 is closed. 
The attacker can forge any arbitrary sequence of pack- 
ets, but it must forge identical packets in both scenarios. 
The hosts in the different scenarios can respond differ- 
ently and contain different internal state. PacketAl 
and PacketA2 are the sequence of packets the attacker 
receives in scenario | and scenario 2, respectively. 


In our model, the attacker can nondeterministically 
choose any arbitrary sequence of packets that do not vi- 
olate the definition of an idle scan. Furthermore, the at- 
tacker need not reply to packets; the fact that the model 
allows the attacker to send any arbitrary packet covers all 
possibilities for reply. For the destination and return IP 
addresses of a packet, the attacker can choose among its 
own IP address, that of the victim or the zombie, or an 
IP address with no live host (that simply drops all pack- 
ets). The only constraint is that the attacker cannot send 
a packet to the victim with its own IP address as the re- 
turn IP address, as this violates the definition of an idle 
scan. 


The attacker can distinguish between SYN cookies 
and regular SYN/ACKs that it receives in our model. 
This is true in reality due to the statistical properties of 
the initial sequence numbers of SYN cookies and the 
fact that they are never retransmitted whereas regular 
SYN/ACKs are. 


The attacker can choose any values for the IP protocol 
(TCP, UDP, or ICMP), TCP flags, source and destination 
ports, validity of checksums, and so on. Every packet 
that the attacker forges 1s forwarded to the appropriate 
host in both scenarios. 
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3.2 SAL for modeling, generating coun- 
terexamples, and verifying properties 


We model the network stack as a transition system. At an 
informal, high level, a transition system specifies compu- 
tation as a sequence of transitions in a state machine. A 
State is given by the values of the local variables used 
to describe transitions. A transition system has an ini- 
tial state. For every transition, there is an optional guard, 
which when true in the current state, leads the computa- 
tion from the current state to the next state. For a nonde- 
terministic transition system, multiple transitions may be 
triggered and one transition is randomly selected. This 
is repeated and the computation terminates if no guard 
is true in the current state. Dijkstra’s guarded command 
language [7] is an example of a formalism for specifying 
transitions. 

We used SAL (Symbolic Analysis Laboratory) for 
specifying the transition system and analyzing its proper- 
ties. SAL is a language and a tool kit for specifying tran- 
sition systems and analyzing them using model check- 
ing. SAL provides support for a suite of tools which have 
been successfully used for analyzing protocols and dis- 
tributed algorithms (see [28]). 

Figure 3 shows the outline of our SAL code for the 
model. Ellipses indicate where detailed code has been 
omitted, the full model is 895 lines of SAL code. 

For a transition step, a nondeterministic choice 1s 
made between the attacker, victim, or zombie. If the 
attacker is chosen, it forges a nondeterministic packet, 
which can be a “drop” packet that has no effect. This 
packet is placed in the receive queue of the destination 
IP address. If the victim or zombie is chosen, it removes 
the next packet from its FIFO receive queue and replies 
based on its internal state and configuration. The func- 
tions ProperReply and UpdateSyncache are re- 
sponsible for choosing the packet to reply with, if any, 
and any updates to the host’s internal state (specifically 
the RST counter and SYN cache). Note that these are 
pure functions and do not update any state themselves. 

Figures 4 and 5 show how the ProperReply and 
UpdateSyncache functions are used. A transi- 
tion has a guard, e.g., “(zl.fullness /= 0 AND 
z2.fullness /= 0 ) --—>”, which is a quantifier- 
free formula specifying a condition on the current state 
and must hold before the transition is executed, and then 
a formula relating the current state with the next state. 
An example of such a formula is “z1’.fullness = 
zi.fullness — 1”, where z1’ is the variable in the 
next state and z1 is the variable in the current state. In 
this example the variable will be decremented by | in the 
next state. 

When the guard on a transition for the zombie or vic- 
tim fires, that host must remove a packet from its queue 
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and then reply and update its state in both scenarios. 
UpdateSyncache returns not only the new state of the 
SYN cache, but also a variable called .synPIsThere 
which can take on the values put, notexist, or 
exist. This return value is passed to ProperReply, 
which needs to know if a SYN packet was put in an entry 
in the SYN cache, no entry was found for it because the 
SYN cache is full, or it already existed in the SYN cache. 
In this way ProperReply knows whether to send a 
SYN/ACK, send a SYN cookie, or drop the packet, re- 
spectively, if the packet isa SYN. 

For example, if the zombie in one of the scenarios re- 
ceives a SYN packet, it calls UpdateSyncache to de- 
termine the new state of the SYN cache and what will 
happen to the packet. If the internal state of the zom- 
bie indicates that there is a free entry in the SYN cache, 
the fact that the SYN will be placed in the SYN cache 
and the new status of the SYN cache are returned by this 
function. Then ProperReply is called with this in- 
formation as an argument, and this function will deter- 
mine that the proper reply is a normal SYN/ACK, with 
the destination IP address as the source of the SYN, valid 
checksums, efc. 

Another example is that a host (a victim or zombie) 
receives a SYN/ACK. UpdateSyncache returns the 
current state (i.e., no changes will be made to the SYN 
cache state) and then ProperReply will be called 
and will ignore .synPIsThere because the packet 
is nota SYN. The return value of ProperReply de- 
pends on the RST counter. If the RST counter is non- 
zero the return value of ProperReply will be a RST 
packet that the zombie will use for a reply and a re- 
duced RST counter. If the RST counter is already zero, 
ProperRep1y will return a drop packet and zero still 
for the RST counter. All possible TCP, UDP, or ICMP 
packets and their corresponding replies are enumerated 
in ProperRep1y based on the reply that a typical net- 
work stack would send. 

By forcing the zombies or the victims in both scenar- 
ios to reply at the same time step, the model stays in se- 
quence. If a host in one scenario (e.g., the victim in the 
closed port scenario) replies to a packet whereas the cor- 
responding host in the other scenario (e.g., the victim in 
the open port scenario) drops the packet, a “drop” entry 
is inserted in the destination host’s queue as a filler and 
eventually ignored. This ensures that the packets that are 
received by the attacker vary only when non-interference 
is violated, i.e., only when the sequence diverges. 

SAL supports a suite of tools; the ones most rele- 
vant for the analysis discussed in this paper include a 
deadlock checker, a symbolic model checker for finite 
State systems based on the CUDD BDD package, and a 
bounded model checker based on the Yices SAT solver. 
Properties of a transition system are specified in Lin- 
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Figure 3: Outline of SAL code for the network protocol stack model. 


Figure 5: Structure of the UpdateSyncache and ProperRep1y functions. 
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ear Temporal Logic (LTL). Our analysis involved using 
properties of the form G(a), where a is a quantifier-free, 
modality free formula expressed using state variables, to 
mean that a holds in every state of the transition system. 
The non-interference property is specified as: 


t+ G(PacketAl = PacketA2) 


This means that the sequence of packets the attacker 
receives in response to its probes from the zombie in the 
first Scenario is always identical to the response from the 
zombie in the second scenario. 

We have used SAL’s bounded model checker for 
finding counterexamples as it is depth-first and explic- 
itly enumerates states. SAL’s symbolic model checker, 
which is exhaustive, is useful for finding smaller coun- 
terexamples as well as for proving properties of interest, 
which are often difficult to do by explicit state enumera- 
tion model checkers. A useful comparative study of ex- 
haustive symbolic model checkers and explicit state enu- 
meration model checkers is in [5] for protocol analysis 
and controllers. 


3.3. Assumptions to reduce the number of 
States in the model 


A number of assumptions were made to keep our model 
simple. Our strategy was to start with a simple model 
and introduce additional complexity into the model if 
no counterexamples are generated, and ensure that the 
abstractions we made caused no loss of generality that 
would exclude potential counterexamples. 

A major abstraction in the model is that we consider 
the proper reply to SYN/ACK packets to be “drop” for 
open ports and RST for closed ports. In reality, network 
stacks that respond differently to SYN/ACKs on open 
vs. closed/filtered ports typically respond with RSTs or 
ICMP and have different rate limits per port. Since the 
lower rate limit (typically ICMP) will cause drops before 
the higher rate limit, without loss of generality, we can 
consider open ports to simply drop SYN/ACK packets 
from the initial state. This is equivalent to assuming that 
the attacker immediately exhausts the lower rate limit. 

We also exclude ICMP and UDP from the split SYN 
cache version of our model. Since ICMP host error pack- 
ets have the same effect on the SYN cache as RSTs, 
and other ICMP and UDP packets make no relevant 
changes to the destination host’s TCP state, ICMP and 
UDP do not affect the non-interference property for the 
SYN cache structure. Invalid checksums in packet head- 
ers are also excluded, because they are dropped without 
affecting the state of the destination host in all cases. 

Another major abstraction is that each of the two 
buffers in our split SYN cache has only a single entry. 
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There are three reasons why only a single entry in the 
SYN cache is necessary in the model: 


e Pending entries in the SYN cache with source IP ad- 
dresses and ports (possibly forged by the attacker) 
that correspond to invariant ports (that have the 
same status in both scenarios) cannot cause diver- 
gence in the internal state of any of the hosts in the 
two scenarios. Thus, no more than one such SYN 
cache entry at a time can be useful for creating a 
counterexample. 


e Even though RST rate limiting is performed sepa- 
rately for open and closed ports, the rate limit value 
stored by any host cannot be caused to diverge on 
invariant ports. Only the target port on the victim 
can cause divergence. Since only one such port ex- 
ists, only one SYN cache entry at a time can be use- 
ful for creating a counterexample. If we had not 
received a counterexample under this assumption, 
we would have incrementally allowed more entries 
in the SYN cache. 


e While the single entry is full, the SYN cookies gen- 
erated in response to dropped SYN packets can only 
cause internal state differences if sent to the target 
port on the victim. If this is the case then the en- 
try in the SYN cache cannot also be a SYN packet 
with the victim IP address and target port, since du- 
plicate SYNs are ignored by the SYN cache. Thus, 
one SYN cache entry per trust level (trusted and un- 
trusted) is general enough to handle all of the cases 
of any number of SYN cache entries. 


Because of the above simplification of making the 
SYN cache have a single entry for each trust level, we 
modeled only three ports without loss of generality. Port 
1 is prohibited (e.g. by a firewall rule) to the attacker for 
the split SYN cache implementation, meaning that the 
attacker cannot send packets to port 1. This was done 
so that we could include the RST rate limitation, which 
has important interactions with the SYN cache, without 
receiving the RST rate limit counterexample. Port 1 is 
closed in both scenarios for the zombie; however, for the 
victim, it is open in one scenario and closed in the other. 
In other words, port | is the target port for the attacker 
to get information. Port 2 is closed and port 3 is open 
on both hosts in both scenarios. Because closed ports 
are equivalent in terms of their responses, a single closed 
port per host is equivalent to any number of closed ports. 
Because the SYN cache has a single entry and open ports 
only have different behaviors based on the status of the 
SYN cache, a single open port per host is also equivalent 
to any number of open ports. 

In real SYN cache implementations, there is a timeout 
after which SYNs that have not become fully open TCP 


USENIX Association 


Port | Zombie status Victim Status 


] Open pen in scenario 1, 
closed in scenario 2 


Closed 
Open 


Table 1: Ports and their status in our model. 





connections are dropped. Because our model allows the 
attacker to remove any entry from the SYN cache at any 
time via a RST packet (which is also possible in reality 
for Linux SYN cache implementations), our model need 
not incorporate this timeout. Also, RST rate limiting is 
done per a time period in reality. A fixed limit of RSTs 
for an unbounded amount of time is a generalization of 
this that does not exclude any counterexamples because 
for any violation of non-interference based on a rate limit 
a single time period is enough to create a counterexam- 


ple. 


4 Finding and ameliorating idle scans 


In this section, we describe the counterexamples that 
our modeling effort produced and give experimental re- 
sults of an implementation of these counterexamples to 
demonstrate that they can indeed be used to do idle port 
scans. 


4.1 Discovering counterexamples 


We now describe the two counterexamples that were dis- 
covered during the process of developing the model. 


Open Scenario Closed Scenario 


Zombie2 Victim2 Attacker Victiml Zombiel 


RST count=1_ : 
2 






RST count=1 


RST count=GQ4 | 


2 It-RST count=0 








Figure 6: RST rate limiting counterexample. 


4.1.1 RST rate limiting counterexample 


When we applied SAL’s bounded model checker to 
a simpler version of the model, in which the SYN 
cache did not play any role, for the property “t 


USENIX Association 


Open Scenario Closed Scenario 


Zombie2 Victim2 Attacker Victiml Zombiel 





: ON gee oy 
SYN cache=0-| 5-77 eer » SYN cache=0 
oS : : , = 3 
BStWace 2  gyNACK g 
_ , ae SYN cache=1 
3 : 2 2 3p SYN cache=0 





| SYN cache=1 








v 
Time 


Figure 7: SYN cache counterexample. 


G(PacketAl = PacketA2)” SAL identified a counterex- 
ample with RST_counter set to 3 in the initial state. 
We simplified the model further by reducing the initial 
value of RST_counter to | and still received a coun- 
terexample. The counterexample in this case was found 
much more quickly, at depth 5 in the transition system. 

This counterexample is illustrated in Figure 6. The 
figure shows the sequence of packets for the open vs. 
closed scenarios that that attacker can send to distinguish 
between the scenarios. Note that the attacker sends the 
Same sequence of packets in both cases. Dashed lines 
are forged packets (for Figure 6 the packets are forged so 
that they appear to come from the zombie). The numbers 
at the bases and heads of the arrows represent the source 
or destination port number, respectively. The RST count 
State for a period of time is the state of that variable for 
each scenario. 

In this example the attacker wants to discern the port 
Status (open or closed) of port 1 on the victim. Port 2 
on the zombie is closed. The packets the attacker sends 
are identical in both scenarios. The attacker cannot see 
the packets that are sent between the victim and zombie 
or zombie and victim. The port status must be inferred 
by the difference in the expected packet sequence that 
the attacker will see between the two scenarios. First, 
the attacker forges a SYN to the victim on the target port 
that appears to be from the zombie with return port 2. 
If the target port on the victim is open, it will respond 
to the zombie with a SYN/ACK on the zombie’s closed 
port 2, causing the zombie to send the victim a RST and 
decrement its RST count. If the port is closed, the vic- 
tim will respond to the zombie with a RST which the 
zombie ignores. Next, the attacker sends a SYN/ACK 
packet, using its own return IP address, to the closed port 
on the zombie. If the attacker receives a RST in response, 
then it can infer that the victim target port status is closed 
since an open port would have caused the zombie to have 
already reached its RST rate limit. 
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4.1.2 SYN cache counterexample 


For the second case, we tried a more complex model that 
included a SYN cache. We started with a SYN cache 
of size 2, then simplified it further to size 1, and SAL’s 
bounded model checker still identified the counterexam- 
ple to the non-interference property as illustrated in Fig- 
ure 7. 


The relevant state in this case is the number of pend- 
ing SYN/ACK entries in the SYN cache, with a maxi- 
mum value of I in our model. The notable thing about 
this form of idle scan is that the attacker never sends any 
packets to the victim, not even packets with forged re- 
turn IP addresses. Instead, the attacker sends a SYN to 
the zombie on an open port, with the return IP address of 
the victim and the return port as the target port. The zom- 
bie places this SYN packet in the SYN cache, which in 
our model has only a single entry, and sends a SYN/ACK 
response to the victim. If the victim target port is closed 
it will send a RST in response, which causes the zombie 
to remove the relevant SYN cache entry so that there is 
now a free entry in the SYN cache. An open target port 
on the victim will simply drop the SYN/ACK from the 
zombie, so that the SYN cache of the zombie remains 
full since the zombie is still waiting for a response to the 
SYN/ACK. The attacker can then infer the status of the 
zombie’s SYN cache, and therefore the victim port sta- 
tus, by sending a SYN to the zombie with the attacker’s 
own return IP address. A regular SYN/ACK means the 
SYN cache entry was free, a SYN cookie indicates that 
it was full. 


Note that responses to SYN/ACKs on open, closed, 
or filtered ports vary for different operating systems, but 
all that matters is that for open vs. closed or open vs. 
filtered the response differs in some way under certain 
conditions. More discussion of the possibilities for this 
is in Section 4.2. The SYN cache counterexample makes 
it possible to, e.g., port scan a network on a port that 
is blocked for the entire network from outside the fire- 
wall. Imagine in Figure 7 that the zombie and victim 
are behind a firewall and the attacker is outside the fire- 
wall. Even if the firewall drops all incoming packets with 
destination port, e.g., 22 for Secure Shell (SSH), the at- 
tacker can scan port 22 on the network by using other 
open ports. Also, there may be firewall rules that en- 
force that only trusted machines (e.g., the zombie) can 
route packets to the victim. In this case the victim might 
be an internal database server and the zombie is the web 
server interface to the database, for example. Informa- 
tion about what ports the victim has open might give the 
attacker an idea of whether compromising the zombie to 
subsequently get access to the victim is worth the effort 
and risk. It might also be that the attacker can route pack- 
ets to the victim, but not on the target port. For example, 
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many machines leave certain ports open only for their 
backup servers that contact them nightly. Or, the system 
administrator might only allow incoming SSH sessions 
on their critical servers from their own office machines 
and not from other IP addresses. Knowing these kinds of 
trust relationships and exploiting them to find out more 
about the victim machines can be very valuable to an at- 
tacker. 

For each host, both the SYN cache and the reset rate 
limiting variables constitute shared, limited resources, 
which are the sources of violations of non-interference 
in our two counterexamples. 


4.2 Experimental confirmation of coun- 
terexamples 


We implemented both counterexamples to verify that 
these two new forms of idle scan that resulted from the 
modeling effort were possible for real hosts. Our results 
presented in this section demonstrate that the differences 
in the sequence of packets the attacker sees translate from 
the abstract notion of non-interference in our model to 
differences that can be seen in real network packet traces. 
Our implementations of the two idle scans are not opti- 
mized for speed or stealth, nor do they account for packet 
loss or packet delay, but in this section we discuss the 
practicality of these two forms of idle scan and conclude 
that they are both practical. 


4.2.1 Experimental setup 


For our experiments, we set up VirtualBox [34] virtual 
machines connected using IPv4 on two different subnet- 
works with TUN/TAP interfaces. The attacker machine 
was the host, and one subnet contained a Linux kernel 2.4 
host (Fedora Core 1) which served as the zombie for the 
SYN cache idle scan implementation. The other subnet 
contained a Windows XP host with no service packs, a 
Linux kernel 2.6 host (CentOS 5.2), and a FreeBSD 7.1.1 
host. The latter served as the zombie for RST rate lim- 
iting idle scan implementation. IP forwarding between 
these two subnetworks was performed by the host. Pack- 
ets were generated and captured by separate threads us- 
ing the Perl Net::RawIP and Net::Pcap libraries, respec- 
tively. 


4.2.2. RST rate limiting idle scan implementation 


In our transition system model, RSTs are limited to a 
finite number for infinite time. For a real FreeBSD sys- 
tem, RSTs are limited to a default of 200 per second, 
with separate limitations for open and closed ports. Our 
implementation sends 2000 each of two different types 
of packets, each at a rate of 180 per second, to the victim 
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Table 2: Results from RST rate limiting idle scan im- 
plementation. 


and FreeBSD zombie, respectively. One type of packet 
is forged SYNs to the victim on the target port that ap- 
pear to be from the zombie on a port that is closed on the 
zombie. The other is SYN/ACKs from the attacker to the 
zombie, which the zombie should reply to with RSTs. If 
the zombie is sending RSTs at a rate of 180 per second to 
the victim in response to the victim’s SYN/ACKs (mean- 
ing the victim target port is open), this should interfere 
with the rate at which the zombie sends RSTs to the at- 
tacker. Thus the number of RSTs the attacker receives in 
our experiment can be used to infer the port status of the 
target port on the victim. We repeated the RST rate limit- 
ing idle scan experiment 700 times each for an open and 
closed port on the victim. The victim was a Linux ker- 
nel 2.6 virtual machine. The host-based firewalls on both 
machines were disabled, although for the victim the idle 
scan works whether the host-based firewall is enabled or 
not. For FreeBSD, RST rate limiting does not apply to 
filtered ports. The pf host-based firewall is disabled by 
default for FreeBSD installations. 

The results from our RST rate limiting idle scan are 
shown in Table 2, where the results are based on the num- 
ber of RSTs the attacker receives. When the victim port 
is closed, the attacker receives all 2000 RST responses 
from the zombie. When the victim port is open, the at- 
tacker receives at most 1634 RSTs. Thus, determining 
if the target port is open or closed is straightforward for 
idle scans based on RST rate limiting. 


4.2.3. SYN cache idle scan implementation 


SYN cache implementations vary for different operating 
systems. While the SYN cache idle scan is possible us- 
ing virtually any host as a zombie, the simplest network 
stack to use as a zombie is Linux kernel 2.4. Linux ker- 
nel 2.4 uses a simple buffer for the SYN cache, with be- 
tween 128 and 1024 entries depending on the memory 
available on the system. Our Linux kernel 2.4 virtual ma- 
chine zombie hada SYN cache size of 128, but Linux en- 
forces a rule that only three fourths of the SYN cache can 
contain SYN packets from hosts that have not demon- 
strated their liveness in the recent past by completing a 
fully open TCP connection. This effectively reduces the 
SYN cache size to 97. We did not enable SYN cookies, 
which are disabled by default in Linux. The attack works 
basically the same whether or not SYN cookies are en- 
abled. We ran two separate sets of experiments for the 
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SYN cache idle scan implementation, one to demonstrate 
that it is possible for the attacker to detect the presence of 
live machines and perform a rudimentary form of operat- 
ing system detection, and another to demonstrate that un- 
der certain circumstances the attacker can infer the port 
Status of a target port on a particular victim IP address. 
For all experiments, 100 data points were generated for 
both open and closed port scenarios. 

For checking for liveness, we scanned four different IP 
addresses. One is a default FreeBSD 7.1.1 machine (with 
the pf host-based firewall disabled, as is the default), an- 
other is a Windows XP machine with no service packs 
(with Windows firewall disabled, as is the default), and 
a third is a Linux kernel 2.6 machine (CentOS 5.2, with 
iptables enabled, as is the default). The fourth IP address 
has no live host so all packets are simply dropped. Forg- 
ing packets from random return IP addresses on these 
victims is very likely to send SYN/ACKs to closed or 
filtered ports, so we choose random return ports for all 
forged SYNs where the attacker uses the victim as the 
return IP address. Varying this return port number is im- 
portant because if the return port is not different then the 
forged SYNs will have the same IP addresses and ports 
for both the destination and source and the SYN cache 
will drop such duplicates. Both RSTs and ICMP errors 
cause their corresponding entries to be removed from the 
SYN cache when received by the zombie. 

Because Linux responds to SYN/ACKs on filtered 
ports at a very low rate (about 10 per second) with 
ICMP host prohibited packets, FreeBSD responds to 
SYN/ACKs on closed ports at a rate of at most 200 per 
second, Windows responds on closed ports with RSTs at 
an unlimited rate—and IP addresses without live hosts 
simply cause SYN/ACKs to be dropped—it is possible 
for the attacker to idle scan a subnetwork and infer some- 
thing about the operating systems that the live hosts dis- 
covered have installed. To scan a single IP address, our 
implementation sends 50 forged SYNs (that appear to be 
from the victim), then 50 each of forged SYNs and SYNs 
where the attacker uses their own return IP address, and 
then 200 more forged SYNs, all at a rate of 1000 per sec- 
ond. It then sends 200 each of forged SYNs and SYNs 
where the attacker uses their own return IP address at a 
rate of 400 per second. The number of SYNs where the 
attacker uses their own return IP address and receives a 
SYN/ACK response can then be used to infer the live- 
ness and operating system of the IP address. The results 
from this experiment are shown in Table 3, where the re- 
sults are based on the number of SYN/ACKs the attacker 
receives (note that for Linux kernel 2.4 network stacks 
SYN/ACKs are retransmitted five times until they time 
out after 190 seconds). 

Under certain circumstances, it 1s also possible to port 
scan specific ports on a particular IP address using aS YN 


19th USENIX Security Symposium 267 


268 


-_Notlive | 1091] 74 | 96 | 123 


[Freessb_[300_| 0 | 300 | 300 


Table 3: Results from SYN cache idle scan implemen- 
tation for liveness and operating system. 





cache-based idle scan. Specifically, if the response or 
rates differ for open vs. closed or filtered ports on the 
victim then scanning a target port is possible. Examples 
of this are FreeBSD with the pf host-based firewall dis- 
abled, where open ports and closed ports are rate-limited 
separately, or Linux hosts with the iptables host-based 
firewall enabled and an open port that does not use the 
stateful module of iptables. 


To test the FreeBSD example, we developed a SYN 
cache-based idle scan that simultaneously sends 20000 
forged SYN packets (with random return ports that are 
closed on the zombie) as quickly as possible while send- 
ing, at half the rate, alternating forged SYNs with the tar- 
get port on the victim as the source port and valid SYNs 
with the return address of the attacker. Because closed 
ports on the victim are rate limited due to the forged 
SYNs with random return ports coming from the zom- 
bie, the forged SYNs with the target port on the victim as 
their return port will quickly fill the SYN cache if the tar- 
get port is also closed and cause fewer entries to be free 
for non-forged attacker SYNs, therefore causing the at- 
tacker to see fewer SYN/ACKs in response. If the target 
port is open, the open port sends more RSTs before rate 
limiting begins meaning that more SYN cache entries re- 
main free and the attacker sees more SYN/ACKs. The 
results of this experiment are shown in Table 4, where 
the results are based on the number of SYN/ACKs the 
attacker receives. Some data points for both closed and 
open ports were thrown out due to failures of the Python 
peap library at high packet rates. Packet loss due to the 
high rates could only make the distributions more sim- 
ilar, not less, because more packets are sent over the 
TUN/TAP interface for the open port scenario. Thus, the 
distributions for open and closed ports are clearly dif- 
ferent. A two-sampled, unpooled ¢-test (which assumes 
neither known variances nor equal variances) for these 
two sets of data gives a ¢ score of 7.71 with 197 de- 
grees of freedom, which corresponds with a p-score of 
0.999999999999696 meaning that a null hypothesis that 
the two distributions have an equal mean is rejected with 
very high confidence. 


For port scanning Linux-based victims, the idle scan 
first sends 96 filler SYNs to fill all but one entry in the 
SYN cache. SYN/ACK replies to filler SYNs are not 
counted in the results. Then it alternates, at an overall 
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Open 262.1 41.8 





[Closed [2180 [39.3 | 68 | IB 


Table 4: Results from SYN cache idle scan implemen- 
tation for port scanning FreeBSD. 





Table 5: Results from SYN cache idle scan implemen- 
tation for port scanning Linux. 


rate of 100 packets per second, forged SYNs with the re- 
turn IP address of the victim and return port of the target 
port, filler SYNs, and probe SYNs. Table 5 shows the 
results of these experiments, where the results reflect the 
number of SYN/ACK responses to probe SYNS. 


4.2.4 Stealth and efficiency 


Our idle scan implementations in this section are in- 
tended to show that the abstract counterexamples that re- 
sulted from our modeling effort were real divergences 
in real network stacks that could be exploited by the at- 
tacker for idle scans. Since the divergences are based on 
rates in real network stacks we used hypothesis testing 
to show this. We only report a ¢-score and p-score for 
one set of experiments (the SYN cache idle scan imple- 
mentation for port scanning FreeBSD) because the dis- 
tributions of the results for other experiments were so 
different that their high ¢-scores led to p-scores that were 
within floating point rounding error of 1.0. Our imple- 
mentations of these idle scans were designed for this hy- 
pothesis testing and therefore are not optimized for at- 
tacker stealth or efficiency in carrying out the scan. For 
assessing the practicality of these idle scan techniques, 
we will now comment on stealth and efficiency. 

For the RST rate limiting idle scan, the attacker can- 
not perform the idle scan without sending more than 200 
SYN/ACKs to the zombie either directly or indirectly. 
However, the attacker need not send SYNs to the victim 
(forged from the zombie) at half this rate. It is possible 
to, e.g., send SYNs to the victim at a rate of 20 per sec- 
ond and send SYN/ACKs (or any packet that will elicit a 
RST) to the zombie at 195 per second. Theoretically, the 
mutual information between the victim port status and 
the sequence of packets the attacker sees 1s non-zero even 
if the attacker sends only a single forged SYN to the vic- 
tim, and even when packet loss is accounted for. Thus the 
attacker has a fair amount of flexibility in terms of trad- 
ing off speed of the scan vs. stealth for packets it sends to 
the victim. Furthermore, sending SYNs simultaneously 
to multiple victims and multiple ports and measuring the 
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zombie responses in the aggregate can increase the effi- 
ciency of the scan if the distribution of expected closed 
vs. Open victim ports diverges from an equal distribution. 
To see this, consider the extreme case where a large sub- 
network has only a single host with an open port, some- 
thing similar to a binary search could greatly reduce the 
amount of time necessary for the scan in this case. 


For the SYN cache idle scans, which are more pow- 
erful in terms of the new capabilities they offer attackers 
beyond the currently known idle scan technique, there 
is a wider range of efficiency and stealth tradeoffs that 
the attacker can make. Furthermore, unlike ICMP IPID- 
or RST rate limit-based idle scans, virtually any modern 
network stack that offers any type of protection against 
SYN flooding can be used as a zombie. We chose to use 
a low-memory Linux kernel 2.4-based zombie for our 
experiments due to its simplicity and small SYN cache 
size, but larger SYN cache sizes or more complex SYN 
cache implementations are also easily exploited for SYN 
cache idle scans. The SYN cache only needs to be al- 
most full for SYN cache idle scans to work, and SYNs 
for half-open connections take 190 seconds to timeout 
in Linux by default. So even for high-memory Linux 2.4 
machines with 1024 SYN cache entries (of which 769 are 
used, compared to 97 for 128-entry SYN caches), the rate 
necessary to create the conditions for an idle scan only 
increases from 0.5 SYNs per second from the attacker 
to the zombie to about 4.1 SYNs per second (these rates 
keep the buffer almost full despite the timeouts). Once 
these conditions are created, the attacker effectively can 
do a SYN/ACK scan of the victim host or network at 
the cost of two packets sent per SYN/ACK query and 
three more generated as responses. It also does not mat- 
ter whether or not the zombie implements SYN cook- 
ies, since SYN cookies are never retransmitted (com- 
pared to typically three to five retransmissions of regular 
SYN/ACKs for various zombie configurations) and also 
have easily identifiable statistical anomalies in their ini- 
tial sequence numbers. 


Some SYN cache implementations that are not simple 
buffers like Linux 2.4 may make SYN-cache idle scans 
slightly more difficult, but still possible and relatively ef- 
ficient. For example, the FreeBSD SYN cache imple- 
mentation [19] uses a SYN cache with 512 buckets that 
each have 30 entries and are chosen uniformly at ran- 
dom using a hash of the IP address/port pairs and a host- 
generated secret. This mechanism is designed to stop 
denial-of-service, not idle scans. It creates some equiv- 
ocations that can reduce the amount of information flow 
the attacker can exploit for idle scans but the attacker 
can still perform the scan relatively efficiently even with 
FreeBSD zombies. We have not explored the SYN cache 
implementations of Linux kernel 2.6 or Windows hosts, 
but all modern network stacks must have some form of 
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SYN cache for reliability purposes and a limit on this re- 
source to prevent denial-of-service. Thus, only by mak- 
ing this resource non-shared is non-interference to pre- 
vent idle scans possible, and the current OSI network 
stack model with TCP/IP does not make the necessary 
trust distinctions to split the SYN cache. Thus, virtually 
every end host machine that the attacker can route to at 
least one open port on is a potential zombie. 


The rate at which the attacker must send packets to 
the zombie fora SYN cache idle scan, and therefore the 
stealth of the scan, depends on the attacker’s goals. If 
the zombie is a Linux kernel 2.4 machine and the at- 
tacker wants only to check the liveness of a range of IP 
addresses on the victim network, then between 0.5 and 
A.1 packets per second plus the probes themselves is suf- 
ficient. Note that, in terms of stealth, it is also relevant 
that the attacker need not send any packets to the victim 
for this form of idle scan, not even packets with forged 
return addresses. 


For detecting the operating system of a victim host or 
scanning individual ports on the victim, higher rates are 
necessary. Detecting a Linux machine on the victim net- 
work and port scanning it can easily be done at between 
10 and 20 packets per second. We also discovered during 
our experiments that, at least for Linux kernel 2.4 hosts, 
it is easy for the attacker to not only remove their own 
packets from the SYN cache manually, but any packet 
that they have forged, using forged RSTs. This is be- 
cause only the IP address and port pairs are checked, 
the sequence and acknowledgment numbers for RSTs are 
ignored when deciding whether to drop an entry from 
the SYN cache on the zombie. Thus, the attacker has 
a high degree of control over the SYN cache status of 
the victim. Packet delay, packet loss, and interference 
from other machines that contact the zombie can easily 
be accounted for in this way, and the aggregate effect of 
scanning multiple victims at a time mentioned above also 
applies to SYN cache idle scans. In future work we in- 
tend to model the capabilities of this attack as a Markov 
Decision Process and discern tight bounds on numbers 
of packets and rates needed for different attacker goals. 


In terms of the practicality of our attacks, the ability 
to scan firewalled ports and discover machines on pro- 
tected networks that the attacker cannot route packets 
to certainly underscores the need for good ingress filter- 
ing and DMZ management. Our attacks are applicable 
in all three of the following scenarios: when the victim 
and zombie are on the same subnet and communicate us- 
ing ARP’, when the victim and zombie are within the 
same network domain but on different subnets, and when 
the victim and zombie are geographically separated by 
some distance on the Internet. The attacker can be in- 
side or outside the network domain of the zombie and 
victim. Thus, many possibilities for network inference 
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arise. For example, the attacker can infer when a host 
opens ports only to other particular machines, such as a 
backup server or network administrator. While many net- 
work configurations prevent IP address spoofing, which 
is essential for both attacks described in this paper, idle 
scans are a very general technique that can apply in a 
variety of scenarios. 


4.3. Ensuring non-interference using the 
SAL model checker 


Based on our experimental results from implementing 
the two counterexamples as idle scan attacks, it is ap- 
parent that RST rate limiting and the SYN cache interact 
in complex ways and cannot be considered separately. 
Thus we chose to leave RST rate limiting in the model 
for verifying non-interference of the split SYN cache. 

It is well-known that verifying properties using a 
model checker is much more difficult than finding 
a counterexample. We abstracted the model down 
to the simplest form that produces both counterex- 
amples, and attempted to prove the non-interference 
property for cases where the shared, limited resources 
were split based on trust relationships and therefore no 
longer shared. The zombie and victim consider each 
other trusted and the attacker untrusted. For the RST 
rate limiting counterexample, the hosts have separate 
RST_count RST counters for trusted vs. untrusted 
hosts, and the SYN cache is removed. For the SYN cache 
counterexample, we implemented a split SYN cache 
structure with separate SYN cache buffers for trusted vs. 
untrusted hosts. 

In the first case, we removed the SYN cache and fo- 
cused only on the RST rate limitation counter example. 
Since symbolic model checkers are known to be better 
for verifying properties in contrast to explicit state enu- 
meration based bounded-model checkers, we used SAL’s 
symbolic model checker. It verified the property that: 


- G(PacketAl = PacketA2) 


This verification completed in a little over 5 minutes. 

Encouraged by this result, we introduced the SYN 
cache back into the model. The symbolic model checker 
ran out of memory on a machine with 16GB of mem- 
ory after three days. We then ran the bounded model 
checker up to depth 1000 (to mean that all sequences of 
transitions of length < 1000 are checked for counterex- 
amples), and the model checker did not report any coun- 
terexample, which is very encouraging. This means that 
the attacker cannot violate non-interference with any idle 
scan where less than 1000 transitions occur. The SYN 
cache counterexample to our shared SYN cache imple- 
mentation required only 5 transitions. Informally, this 
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result means that there exists no attack, even with only 
a single entry in the SYN cache, where the attacker can 
violate non-interference with 1000 or fewer packets be- 
ing generated by the attacker or by the zombie and vic- 
tim’s responses. Currently, we are exploring alternatives 
to symbolic model checking for the split SYN cache, in- 
cluding verifying the property through k&-induction [6]. 
We are also considering attempting a proof by induction 
on an induction-based theorem prover such as ACL2 or 
RRL. 


5 Concluding remarks and future work 


We modeled idle scans for modern network stacks us- 
ing transition systems and analyzed them using model 
checking. This modeling effort led to the discovery of 
two new forms of idle scan, each of which was associ- 
ated with a shared, limited resource. Our results demon- 
Strate that non-interference for network protocol stacks 
warrants further study. We discovered two new forms 
of idle scan, one of which gives the attacker capabilities 
that no current attacker port scanning capabilities below 
layer 7 (the application layer) provide. We demonstrated 
in this paper that it 1s possible for an attacker to port scan 
a network from outside the firewall on a port that the fire- 
wall blocks, for example. We also showed that this form 
of idle scan, based on SYN caches, can be used for a 
rudimentary form of operating system detection. In light 
of these results, a more formal treatment of information 
flow in networks is needed so that we can better under- 
stand advanced idle scans, both for existing networks and 
in future protocol designs. 

We discussed the stealth and efficiency of the idle 
scans in Section 4.2.4. While it is clear both that the 
attacks are practical and that certain defenses exist in 
some Situations, a more thorough treatment of possible 
scans and defenses to detect or eliminate them is needed. 
By modeling idle scans as a Markov Decision Process, 
it will be possible to explore this space more thoroughly 
and find boundaries in terms of packet rates. 

Using SAL’s model checkers, we were able to iden- 
tify counterexamples to non-interference, in the form of 
idle scans, from our formal model of a network stack as 
a transition system. After fixing the model by splitting 
limited resources and separating them for trusted vs. un- 
trusted hosts we are able to verify the non-interference 
property for the RST rate limit case. However, we were 
only able to show the non-interference property with 
both RST rate limiting and a SYN cache up to 1000 tran- 
sitions. Verifying the non-interference property for this 
more general fix using a model checker remains a chal- 
lenge. We plan to investigate induction methods for this. 
While non-interference and model checking proved use- 
ful for studying specific shared resources, we were not 
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able to build a model complex enough to discover unex- 
pected counterexamples. 

Our model of network stacks was at the level of ab- 
Straction of sequences of packets. A richer model that 
includes memory usage, packet loss, and packet delay 
would likely produce more counterexamples to the non- 
interference property for idle scans. Thus we propose 
that trust relationships be made explicit all the way down 
to the IP layer in future protocol designs. Because all 
resources are inherently limited, giving protocol imple- 
mentations a mechanism that can help divide these re- 
sources and remove sharedness is the only way to ad- 
dress the advanced network reconnaissance attacks of the 
future. Our results in Section 4.3 demonstrated that non- 
interference, which effectively eliminates idle scans, is 
achievable by statically dividing resources based on trust 
relationships. 
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eling effort but were not unexpected results [rom the model checker 
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Abstract 


The Domain Name System (DNS) is an essential protocol 
used by both legitimate Internet applications and cyber at- 
tacks. For example, botnets rely on DNS to support agile com- 
mand and control infrastructures. An effective way to disrupt 
these attacks is to place malicious domains on a “blocklist” 
(or “blacklist” ) or to add a filtering rule in a firewall or net- 
work intrusion detection system. To evade such security coun- 
termeasures, attackers have used DNS agility, e.g., by using 
new domains daily to evade static blacklists and firewalls. In 
this paper we propose Notos, a dynamic reputation system for 
DNS. The premise of this system is that malicious, agile use 
of DNS has unique characteristics and can be distinguished 
from legitimate, professionally provisioned DNS services. No- 
tos uses passive DNS query data and analyzes the network 
and zone features of domains. It builds models of known legit- 
imate domains and malicious domains, and uses these models 
to compute a reputation score for a new domain indicative of 
whether the domain is malicious or legitimate. We have eval- 
uated Notos in a large ISP’s network with DNS traffic from 
1.4 million users. Our results show that Notos can identify 
malicious domains with high accuracy (true positive rate of 
96.8%) and low false positive rate (0.38%), and can identify 
these domains weeks or even months before they appear in 
public blacklists. 


1 Introduction 


The Domain Name System (DNS) [12, 13] maps domain 
names to IP addresses, and provides a core service to applica- 
tions on the Internet. DNS is also used in network security to 
distribute IP reputation information, e.g., in the form of DNS- 
based Block Lists (DNSBLs) used to filter spam [18, 5] or 
block malicious web pages [26, 14]. 

Internet-scale attacks often use DNS as well because they 
are essentially Internet-scale malicious applications. For ex- 
ample, spyware uses anonymously registered domains to ex- 
filtrate private information to drop sites. Disposable domains 
are used by adware to host malicious or false advertising 
content. Botnets make agile use of short-lived domains to 
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evasively move their command-and-control (C&C) infrastruc- 
ture. Fast-flux networks rapidly change DNS records to evade 
blacklists and resist take downs [25]. In an attempt to evade 
domain name blacklisting, attackers now make very aggres- 
sive use of DNS agility. The most common example of an ag- 
ile malicious resource is a fast-flux network, but DNS agility 
takes many other forms including disposable domains (e.g., 
tens of thousands of randomly generated domain names used 
for spam or botnet C&C), domains with dozens of A records or 
NS records (in excess of levels recommended by RFCs, in or- 
der to resist takedowns), or domains used for only a few hours 
of a botnet’s lifetime. Perhaps the best example is the Con- 
ficker.C worm [15]. After Conficker.C infects a machine, it 
will try to contact its C&C server, chosen at random from a list 
of 50,000 possible domain names created every day. Clearly, 
the goal of Conficker.C was to frustrate blacklist maintenance 
and takedown efforts. Other malware that abuse DNS include 
Sinowal (a.k.a. Torpig) [9], Kraken [20], and Srizbi [22]. The 
aggressive use of newly registered domain names is seen in 
other contexts, such as spam campaigns and malicious flux 
networks [25, 19]. This strategy delays takedowns, degrades 
the effectiveness of blacklists, and pollutes the Internet’s name 
space with unwanted, discarded domains. 


In this paper, we study the problem of dynamically assign- 
ing reputation scores to new, unknown domains. Our main 
goal is to automatically assign a low reputation score to a 
domain that is involved in malicious activities, such as mal- 
ware spreading, phishing, and spam campaigns. Conversely, 
we want to assign a high reputation score to domains that are 
used for legitimate purposes. The reputation scores enable dy- 
namic domain name blacklists to counter cyber attacks much 
more effectively. For example, with static blacklisting, by the 
time one has sufficient evidence to put a domain on a black- 
list, it typically has been involved in malicious activities for 
a significant period of time. With dynamic blacklisting our 
goal is to decide, even for a new domain, whether it is likely 
used for malicious purposes. To this end, we propose Notos, 
a system that dynamically assigns reputation scores to domain 
names. Our work is based on the observation that agile mali- 
cious uses of DNS have unique characteristics, and can be dis- 
tinguished from legitimate, professionally provisioned DNS 
services. In short, network resources used for malicious and 
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fraudulent activities inevitably have distinct network charac- 
teristics because of their need to evade security countermea- 
sures. By identifying and measuring these features, Notos can 
assign appropriate reputation scores. 

Notos uses historical DNS information collected passively 
from multiple recursive DNS resolvers distributed across the 
Internet to build a model of how network resources are al- 
located and operated for legitimate, professionally run Inter- 
net services. Notos also uses information about malicious do- 
main names and IP addresses obtained from sources such as 
spam-traps, honeynets, and malware analysis services to build 
a model of how network resources are typically allocated by 
Internet miscreants. With these models, Notos can assign rep- 
utation scores to new, previously unseen domain names, there- 
fore enabling dynamic blacklisting of unknown malicious do- 
main names and IP addresses. 

Previous work on dynamic reputation systems mainly fo- 
cused on IP reputation [24, 31, 1, 21]. To the best of our 
knowledge, our system is the first to create a comprehensive 
dynamic reputation system around domain names. To summa- 
rize, our main contributions are as follows: 


e We designed Notos, a dynamic, comprehensive reputa- 
tion system for DNS that outputs reputation scores for 
domains. We constructed network and zone features that 
capture the characteristics of resource provisioning, us- 
ages, and management of domains. These features enable 
Notos to learn models of how legitimate and malicious 
domains are operated, and compute accurate reputation 
scores for new domains. 


e We implemented a proof-of-concept version of our sys- 
tem, and deployed it in a large ISP’s DNS network in 
Atlanta, GA and San Jose, CA, USA, where we ob- 
served DNS traffic from 1.4 million users. We also used 
passive DNS data from Security Information Exchange 
(SIE) project [3]. This extensive real-world evaluation 
shows Notos can correctly classify new domains with 
a low false positive rate (0.38%) and high true positive 
rate (96.8%). Notos can detect and assign a low reputa- 
tion score to malware- and spam-related domain names 
several days or even weeks before they appear on public 
blacklists. 


Section 2 provides some background on DNS and related 
works. Readers familiar with this may skip to Section 3, where 
we describe our passive DNS collection strategy and other 
whitelist and blacklist inputs. We also describe three fea- 
ture extraction modules that measure key network, zone and 
evidence-based features. Finally, we describe how these fea- 
tures are clustered and incorporated into the final reputation 
engine. To evaluate the output of Notos, we gathered an ex- 
tensive amount of network trace data. Section 4 describes the 
data collection process, and Section 5 details the sensitivity of 
each module and final output. 
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2 Background and Related Work 


DNS is the protocol that resolves a domain name, like 
www.example.com, to its corresponding IP address, for ex- 
ample 192.0.2.10. To resolve a domain, a host typically 
needs to consult a local recursive DNS server (RDNS). A re- 
cursive server iteratively discovers which Authoritative Name 
Server (ANS) is responsible for each zone. The typical result 
of this iterative process is the mapping between the requested 
domain name and its current IP addresses. 

By aggregating all unique, successfully resolved A-type 
DNS answers at the recursive level, one can build a passive 
DNS database. This passive DNS (pDNS) database is ef- 
fectively the DNS fingerprint of the monitored network and 
typically contains unique A-type resource records (RRs) 
that were part of monitored DNS answers. A typical RR 
for the domain name example.com has the following for- 
mat: {example.com. 78366 IN A 192.0.2.10}, 
which lists the domain name, TTL, class, type, and rdata. For 
simplicity, we will refer to an RR in this paper as just a tuple 
of the domain name and IP address. 

Passive DNS data collection was first proposed by Florian 
Weimer [27]. His system was among the first that appeared 
in the DNS community with its primary purpose being the 
conversion of historic DNS traffic into an easily accessible 
format. Zdrnja et al. [29] with their work in “Passive Mon- 
itoring of DNS Anomalies” discuss how pDNS data can be 
used for gathering security information from domain names. 
Although they acknowledge the possibility of creating a DNS 
reputation system based on passive DNS measurement, they 
do not quantify a reputation function. Our work uses the idea 
of building passive DNS information only as a seed for com- 
puting statistical DNS properties for each successful DNS res- 
olution. The analysis of these statistical properties is the basic 
building block for our dynamic domain name reputation func- 
tion. Plonka et al. [17] introduced Treetop, a scalable way to 
manage a growing collection of passive DNS data and at the 
same time correlate zone and network properties. Their clus- 
ter zones are based on different classes of networks (class A, 
class B and class C). Treetop differentiates DNS traffic based 
on whether it complies with various DNS RFCs and based on 
the resolution result. Plonka’s proposed method, despite being 
novel and highly efficient, offers limited DNS security infor- 
mation and cannot assign reputation scores to records. 

Several papers, e.g., Sinha et al. [24] have studied the effec- 
tiveness of IP blacklists. Zhang, et al. [31] showed that the hit 
rate of highly predictable blacklists (HBLs) decreases signifi- 
cantly over a period of time. Our work addresses the dynamic 
DNS blacklisting problem that makes it significantly differ- 
ent from the highly predictable blacklists. Importantly, Notos 
does not aim to create IP blacklists. By using properties of the 
DNS protocol, Notos can rank a domain name as potentially 
malicious or not. Garera et al. [8] discussed “phishing” detec- 
tion predominately using properties of the URL and not sta- 
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tistical observations about the domains or the IP address. The 
statistical features used by Holz et al. [10] to detect fast flux 
networks are similar to the ones we used in our work, however, 
Notos utilizes a more complete collection of network statisti- 
cal features and is not limited to fast flux networks detection. 

Researchers have attempted to use unique characteristics 
of malicious networks to detect sources of malicious activity. 
Anderson et al. [1] proposed Spamscatter as the first system to 
identify and characterize spamming infrastructure by utilizing 
layer 7 analysis (1.e., web sites and images in spam). Hao et 
al. [21] proposed SNARE, a spatio-temporal reputation engine 
for detecting spam messages with very high accuracy and low 
false positive rates. The SNARE reputation engine is the first 
work that utilized statistical network-based features to harvest 
information for spam detection. Notos is complementary to 
SNARE and Spamscatter, and extends both to not only de- 
tect spam, but also identify other malicious activity such as 
phishing and malware hosting. Qian et al. [28] present their 
work on spam detection using network-based clustering. In 
this work, they show that network-based clusters can increase 
the accuracy of spam-oriented blacklists. Our work is more 
general, since we try to identify various kinds of malicious 
domain names. Nevertheless, both works leverage network- 
based clustering for identifying malicious activities. 

Felegyhazi et al. [7] proposed a DNS reputation blacklist- 
ing methodology based on WHOTS observations. Our system 
does not use WHOTS information making our approaches com- 
plementary by design. Sato et al. [23] proposed a way to ex- 
tend current blacklists by observing the co-occurrence of IP 
address information. Notos is a more generic approach than 
the proposed system by Sato and is not limited to botnet re- 
lated domain name detection. Finally, Notos builds the rep- 
utation function mainly based upon passive information from 
DNS traffic observed in real networks — not traffic observed 
from honeypots. 

No previous work has tried to assign a dynamic domain 
name reputation score for any domain that traverses the edge 
of a network. Notos harvests information from multiple 
sources—the domain name, its effective zone, the IP address, 
the network the IP address belongs to, the Autonomous Sys- 
tem (AS) and honeypot analysis. Furthermore, Notos uses 
short-lived passive DNS information. Thus, it is difficult for a 
malicious domain to dilute its passive DNS footprint. 


3 Notos: A Dynamic Reputation System 


The goal of the Notos reputation system is to dynamically 
assign reputation scores to domain names. Given a domain 
name d, we want to assign a low reputation score if d is in- 
volved in malicious activities (e.g., if it has been involved with 
botnet C&C servers, spam campaigns, malware propagation, 
etc.). On the other hand, we want to assign a high reputation 
score if dis associated with legitimate Internet services. 
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Notos’ main source of information is a passive DNS 
(pDNS) database, which contains historical information about 
domain names and their resolved IPs. Our pDNS database is 
constantly updated using real-world DNS traffic from multiple 
geographically diverse locations as shown in Figure 1. We col- 
lect DNS traffic from two ISP recursive DNS servers (RDNS) 
located in Atlanta and San Jose. The ISP nodes witness 30,000 
DNS queries/second during peak hours. We also collect DNS 
traffic through the Security Information Exchange (SIE) [3], 
which aggregates DNS traffic received by a large number of 
RDNS servers from authoritative name servers across North 
America and Europe. In total, the SIE project processes ap- 
proximately 200 Mbit/s of DNS messages, several times the 
total volume of DNS traffic in a single US ISP. 

Another source of information we use is a list of known 
malicious domains. For example, we run known malware 
samples in a controlled environment and we classify as sus- 
picious all the domains contacted by malware samples that do 
not match a pre-compiled white list. In addition, we extract 
suspicious domain names from spam emails collected using a 
large spam-trap. Again, we discard the domains that match 
our whitelist and consider the rest as potentially malicious. 
Furthermore, we collect a large list of popular, legitimate do- 
mains from alexa.com (we discuss our data collection and 
analysis in more details in Section 4). The set of known mali- 
cious and legitimate domains represents our knowledge base, 
and is used to train our reputation engine, as we discuss in 
Section 4. 

Intuitively, a domain name d can be considered suspicious 
when there is evidence that d or its IP addresses are (or were in 
previous months) associated with known malicious activities. 
The more evidence of “bad associations” we can find about 
d, the lower the reputation score we will assign to it. On the 
other hand, if there is evidence that d is (or was in the past) as- 
sociated with legitimate, professionally run Internet services, 
we will assign it a higher reputation score. 


3.1 System Overview 


Before describing the internals of our reputation sys- 
tem, we introduce some basic terminology. A domain 
name d consists of a set of substrings or labels sepa- 
rated by a period; the rightmost label is called the top- 
level domain, or TLD. The second-level domain (2LD) 
represents the two rightmost labels separated by a pe- 
riod; the third-level domain (3LD) analogously contains the 
three rightmost labels, and so on. As an example, given 
the domain name d=“a.b.example.com”, T'LD(d)=“com’, 
2LD(d)=“example.com’”, and 3. D(d)=“b.example.com”’. 

Let s be a domain name (e.g., s=“example.com’’). We de- 
fine Zone(s) as the set of domains that include s and all do- 
main names that end with a period followed by s (e.g., do- 
mains ending in “.example.com’’). 

Let D = {dj,d2,...,dm} be a set of domain names. We 
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Figure 1. System overview. 


call A(D) the set of IP addresses ever pointed to by any do- 
main name d € D. 

Given an IP address a, we define BGP(a) to be the set 
of all IPs within the BGP prefix of a, and AS(qa) as the set 
of IPs located in the autonomous system in which a resides. 
In addition, we can extend these functions to take as input 
a set of IPs: given IP set A = aj,a2,...,.an, BGP(A) = 
U,—-1.y BGP(ax); AS(a) is similarly extended. 

To assign a reputation score to a domain name d we proceed 
as follows. First, we consider the most current set A.(d) = 
{a;}i—1..m Of IP addresses to which d points. Then, we query 
our pDNS database to retrieve the following information: 


e Related Historic IPs (RHIPs), which consist of the union 
of A(d), A(Zone(3LD(d))), and A(Zone(2LD(d))). 
In order to simplify the notation we will refer to 
A(Zone(3LD(d))) and A(Zone(2LD(d))) as A3r p(d) 
and Az; p(d), respectively. 


e Related Historic Domains (RHDNs), which comprise the 
entire set of domain names that ever resolved to an IP 
address a € AS(A(d)). In other words, RHDNs contain 
all the domains d; for which A(d;) NM AS(A(d)) # 9. 


After extracting the above information from our pDNS 
database, we measure a number of statistical features. Specif- 
ically, for each domain d we extract three groups of features, 
as shown in Figure 2: 


e Network-based features: The first group of statistical 
features is extracted from the set of RHIPs. We measure 
quantities such as the total number of IPs historically as- 
sociated with d, the diversity of their geographical loca- 
tion, the number of distinct autonomous systems (ASs) 
in which they reside, etc. 


e Zone-based features: The second group of features we 
extract are those from the RHDNSs set. We measure the 
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average length of domain names in RHDNs, the number 
of distinct TLDs, the occurrence frequency of different 
characters, etc. 


e Evidence-based features: The last set of features in- 
cludes the measurement of quantities such as the number 
of distinct malware samples that contacted the domain d, 
the number of malware samples that connected to any of 
the IPs pointed by d, etc. 


Once extracted, these statistical features are fed to the 
reputation engine. Notos’ reputation engine operates in two 
modes: an off-line “training” mode and an on-line “classifica- 
tion” mode. During the off-line mode, Notos trains the repu- 
tation engine using the information gathered in our knowledge 
base, namely the set of known malicious and legitimate do- 
main names and their related IP addresses. Afterwards, during 
the on-line mode, for each new domain d, Notos queries the 
trained reputation engine to compute a reputation score for d 
(see Figure 3). We now explain the details about the statistical 
features we measure, and how the reputation engine uses them 
during the off-line and on-line modes to compute a domain 
names’ reputation score. 


3.2 Statistical Features 


In this section we identify key statistical features and the 
intuition behind their selection. 


3.2.1 Network-based Features 


Given a domain d we extract a number of statistical features 
from the set RHIPs of d, as mentioned in Section 3.1. Our 
network-based features describe how the operators who own d 
and the IPs that domain d points to, allocate their network re- 
sources. Internet miscreants often abuse DNS to operate their 
malicious networks with a high level of agility. Namely, the 
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Figure 3. Off-line and on-line modes in Notos. 


domain names and IPs that are used for malicious purposes 
are often short-lived and are characterized by a high churn 
rate. This agility avoids simple blacklisting or removals by 
law enforcement. In order to measure the level of agility of 
a domain name d, we extract eighteen statistical features that 
describe d’s network profile. Our network features fall into the 
following three groups: 


e BGP features. This subset consists of a total of nine fea- 
tures. We measure the number of distinct BGP prefixes 
related to BG P( A(d)), the number of countries in which 
these BGP prefixes reside, and the number of organiza- 
tions that own these BGP prefixes; the number of distinct 
IP addresses in the sets A37p(d) and Az, p(d); the num- 
ber of distinct BGP prefixes related to BG P(A3r,p(d)) 
and BG'P(A2,p(d)), and the number of countries in 
which these two sets of prefixes reside. 


e AS features. This subset consists of three features, 
namely the number of distinct autonomous systems re- 


lated to AS(A(d)), AS(Asrp(d)), and AS(Azgzrp(d)). 


e Registration features. This subset consists of six features. 
We measure the number of distinct registrars associated 
with the IPs in the A(d) set; the diversity in the regis- 
tration dates related to the IPs in A(d); the number of 
distinct registrars associated with the IPs in the A3, p(d) 
and Az,p(d) sets; and the diversity in the registration 
dates for the IPs in A3;p(d) and Agr p(d). 


While most legitimate, professionally run Internet services 
have a very stable network profile, which is reflected into low 
values of the network features described above, the profiles of 
malicious networks (e.g., fast-flux networks) usually change 
relatively frequently, thus causing their network features to be 
assigned higher values. We expect a domain name d from a 
legitimate zone to exhibit a small values in its AS features, 
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Figure 4. (a) Network profile modeling in Notos. 
(b) Network and zone based clustering in Notos. 


mainly because the IPs in the RHIPs should belong to the 
same organization or a small number of different organiza- 
tions. On the other hand, if a domain name d participates in 
malicious activities (i.e., botnet activities, flux networks), then 
it could reside in a large number of different networks. The list 
of IPs in the RHIPs that correspond to the malicious domain 
name will produce AS features with higher values. In the same 
sense, we measure that homogeneity of the registration infor- 
mation for benign domains. Legitimate domains are typically 
linked to address space owned by organizations that acquire 
and announce network blocks in some order. This means that 
the registration-feature values for a legitimate domain name 
d that owned by the same organizations will produce a list of 
IPs in the RHIPs that will have small registration feature val- 
ues. If this set of IPs exhibits high registration feature values, 
it means that they very likely reside in different registrars and 
were registered on different dates. Such registration-feature 
properties are typically linked with fraudulent domains. 


3.2.2 Zone-based Features 


The network-based features measure a number of characteris- 
tics of IP addresses historically related to a given domain name 
d. On the other hand, the zone-based features measure the 
characteristics of domain names historically associated with 
d. The intuition behind the zone-based features is that while 
legitimate Internet services may be associated with many dif- 
ferent domain names, these domain names usually have strong 
similarities. For example, google.com, googlesyndi- 
cation.com, googlewave.com, etc., are all related to 
Internet services provided by Google, and contain the string 
“google” in their name. On the other hand, malicious domain 
names related to the same spam campaign, for example, often 
look randomly generated and share few common characteris- 
tics. Therefore, our zone-based features aim to measure the 
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level of diversity across the domain names in the RHDNs set. 
Given a domain name d, we extract seventeen statistical fea- 
tures that describe the properties of the set RHDNs of domain 
names related to d. We divide these seventeen features into 
two groups: 


e String features. This group consists of twelve features. 
We measure the number of distinct domain names in 
RHDNs, and the average and standard deviation of their 
length; the mean, median, and standard deviation of the 
occurrence frequency of each single character in the do- 
main name strings in RHDNs; the mean, median and 
standard deviation of the distribution of 2-grams (..e., 
pairs of characters); the mean, median and standard devi- 
ation of the distribution of 3-grams. 


e TLD features. This group consists of five features. For 
each domain d; in the RHDNSs set, we extract its top-level 
domain TLD(d;) and we count the number of distinct 
TLD strings that we obtain; we measure the ratio between 
the number of domains d; whose T'L.D(d;)=“.com” and 
the total number of TLD different from “.com’’; also, we 
measure the mean, median, and standard deviation of the 
occurrence frequency of the TLD strings. 


It is worth noting that whenever we measure the mean, me- 
dian and standard deviation of a certain property, we do so in 
order to summarize the shape of its distribution. For exam- 
ple, by measuring the mean, median, and standard deviation 
of the occurrence frequency of each character in a set of do- 
main name strings, we summarize how the distribution of the 
character frequency looks like. 


3.2.3. Evidence-based Features 


We use the evidence-based features to determine to what ex- 
tent a given domain d is associated with other known mali- 
cious domain names or IP addresses. As mentioned above, 
Notos collects a knowledge base of known suspicious, ma- 
licious, and legitimate domain names and IPs from public 
sources. For example, we collect malware-related domain 
names by executing large numbers of malware samples in a 
controlled environment. Also, we check IP addresses against 
a number of public IP blacklists. We elaborate on how we 
build Notos’ knowledge base in Section 4. Given a domain 
name d, we measure six statistical features using the informa- 
tion in the knowledge base. We divide these features into two 
groups: 


e Honeypot features. We measure three features, namely 
the number of distinct malware samples that, when ex- 
ecuted, try to contact d or any IP address in A(d); the 
number of malware samples that contact any IP address 
in BGP(A(d)); and the number of samples that contact 
any IP address in AS(A(d)). 
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e Blacklist features. We measure three features, namely the 
number of IP addresses in A(d) that are listed in public 
IP blacklists; the number of IPs in BG'P(A(d)) that are 
listed in IP blacklists; and the number of IPs in AS(A(d)) 
that are listed in IP blacklists. 


Notos uses the blacklist features from the evidence vector 
so it can identify the re-use of known malicious network re- 
sources like IPs, BGP prefixes or even ASs. Domain names 
are significantly cheaper than IPv4 addresses; so malicious 
users tend to reuse address space with new domain names. We 
should note that the evidence-based features represent only 
part of the information we used to compute the reputation 
scores. The fact that a domain name was queried by malware 
does not automatically mean that the domain will receive a 
low reputation score. 


3.3. Reputation Engine 


Notos’ reputation engine is responsible for deciding 
whether a domain name d has characteristics that are simi- 
lar to either legitimate or malicious domain names. In order 
to achieve this goal, we first need to train the engine to rec- 
ognize whether d belongs (or is “close”) to a known class of 
domains. This training can be repeated periodically, in an off- 
line fashion, using historical information collected in Notos’ 
knowledge base (see Section 4). Once the engine has been 
trained, it can be used in on-line mode to assign a reputation 
score to each new domain name d. 

In this section, we first explain how the reputation engine 
is trained, and then we explain how a trained engine is used to 
assign reputation scores. 


3.3.1 Off-Line Training Mode 


During off-line training (Figure 3), the reputation engine 
builds three different modules. We briefly introduce each 
module and then elaborate on the details. 


e Network Profiles Model: a model of how well known 
networks behave. For example, we model the network 
characteristics of popular content delivery networks (e.g., 
Akamai, Amazon CloudFront), and large popular web- 
sites (e.g., google.com, yahoo.com). During the on-line 
mode, we compare each new domain name d to these 
models of well-known network profiles, and use this in- 
formation to compute the final reputation score, as ex- 
plained below. 


e Domain Name Clusters: we group domain names into 
clusters sharing similar characteristics. We create these 
clusters of domains to identify groups of domains that 
contain mostly malicious domains, and groups that con- 
tain mostly legitimate domains. In the on-line mode, 
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given a new domain d, if d (more precisely, d’s projec- 
tion into a statistical feature space) falls within (or close 
to) a cluster of domains containing mostly malicious do- 
mains, for example, this gives us a hint that d should be 
assigned a low reputation score. 


e Reputation Function: for each domain name d;,7 = 1..n, 
in Notos’ knowledge base, we fest it against the trained 
network profiles model and domain name clusters. Let 
NM(d;) and DC(d;) be the output of the Network Pro- 
files (NP) module and the Domain Clusters (DC) mod- 
ule, respectively. The reputation function takes in input 
NM (d;), DC(d;), and information about whether d; and 
its resolved IPs A(d;) are known to be legitimate, suspi- 
cious, or malicious (1.e., if they appeared in a domain 
name or IP blacklist), and builds a model that can assign 
a reputation score between zero and one to d. A repu- 
tation score close to zero signifies that d is a malicious 
domain name while a score close to one signifies that d 
is benign. 


We now describe each module in detail. 


3.3.2 Modeling Network Profiles 


During the off-line training mode, the reputation engine builds 
a model of well-known network behaviors. An overview of the 
network profile modeling module can be seen in Figure 4(a). 
In practice we select five sets of domain names that share simi- 
lar characteristics, and /earn their network profiles. For exam- 
ple, we identify a set of domain names related to very popular 
websites (e.g., google.com, yahoo.com, amazon.com) and for 
each of the related domain names we extract their network fea- 
tures, as explained in Section 3.2.1. We then use the extracted 
feature vectors to train a statistical classifier that will be able 
to recognize whether a new domain name d has network char- 
acteristics similar to the popular websites we modeled. 

In our current implementation of Notos we model the fol- 
lowing classes of domain names: 


e Popular Domains. This class consists of a large 
set of domain names under the following DNS 
zones: google.com, yahoo.com, amazon.com, ebay.com, 
msn.com, live.com, myspace.com, and facebook.com. 


e Common Domains. This class of domains includes do- 
main names under the top one hundred zones, accord- 
ing to alexa.com. We exclude from this group all the 
domain names already included in the Popular Domains 
class (which we model separately). 


e Akamai Domains. Akamai is a large content deliv- 
ery network (CDN), and the domain names related to 
this CDN have very peculiar network characteristics. To 
model the network profile of Akamai’s domain names, 
we collect a set of domains under the following zones: 
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akafms.net, akamai.net, akamaiedge.net, akamai.com, 
akadns.net, and akamai.com. 


e CDN Domains. In this class we include domain 
names related to CDNs other than Akamai. For ex- 
ample, we collect domain names under the follow- 
ing zones: panthercdn.com, IInwd.net, cloudfront.net, 
nyud.net, nyucd.net and redcondor.net. We chose not 
to aggregate these CDN domains and Akamai’s domains 
in one class, since we observed that Akamai’s domains 
have a very unique network profile, as we discuss in Sec- 
tion 4. Therefore, learning two separate models for the 
classes of Akamai Domains and CDN Domains allows 
use to achieve better classification accuracy during the 
on-line mode, compared to learning only one model for 
both classes (see Section 3.3.5). 


e Dynamic DNS Domains. This class includes a large set 
of domain names registered under two of the largest dy- 
namic DNS providers, namely No-IP (no-ip.com) and 
DynDNS (dyndns.com). 


For each class of domains, we train a statistical classifier 
to distinguish between one of the classes and all the others. 
Therefore, we train five different classifiers. For example, 
we train a classifier that can distinguish between the class of 
Popular Domains and all other classes of domains. That is, 
given a new domain name d, this classifier is able to recog- 
nize whether d’s network profile looks like the profile of a 
well-known popular domain or not. Following the same logic 
we, can recognize network profiles for the other classes of do- 
mains. 


3.3.3. Building Domain Name Clusters 


In this phase, the reputation engine takes the domain names 
collected in our pDNS database during a training period, and 
builds clusters of domains that share similar network and zone 
based features. The overview of this module can be seen 
in Figure 4(b). We perform clustering in two steps. In the 
first step we only use the network-based features to create 
coarse-grained clusters. Then, in the second step, we split 
each coarse-grained cluster into finer clusters using only the 
zone-based features, as shown in Figure 5. 


Network-based Clustering The objective of network-based 
clustering is to group domains that share similar levels of 
agility. This creates separate clusters of domains with “sta- 
ble” network characteristics and “non-stable’” networks (like 
CDNs and malicious flux networks). 


Zone-based Clustering After clustering the domain names 
according to their network-based features, we further split the 
network-based clusters of domain names into finer groups. 
In this step, we group domain names that are in the same 
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Figure 5. Network & zone based clustering pro- 
cess in Notos, in the case of a Akamai [A] and a 
malicious [B] domain name. 


network-based cluster and also share similar zone-based 
features. To better understand how the zone-based clustering 
works, consider the following examples of zone-based clus- 
ters: 


Cluster 1: 


., 72.247.176.81 e55.g.akamaiedge.net, 72.247.176.94 e68.g.akamaiedge.net, 72.247.176.146 

e120.g.akamaiedge.net, 72.247.176.65 e39.na.akamaiedge.net, 72.247.176.242 
e216.g.akamaiedge.net, 72.247.176.33 e7.g.akamaiedge.net, 72.247.176.156 

e130.g.akamaiedge.net, 72.247.176.208 e182.g.akamaiedge.net, 72.247.176.198 

e172.g.akamaiedge.net, 72.247.176.217 e191.g.akamaiedge.net, 72.247.176.200 
e174.g.akamaiedge.net, 72.247.176.99 e73.g.akamaiedge.net, 72.247.176.103 
e77.g.akamaiedge.net, 72.247.176.59 e33.c.akamaiedge.net, 72.247.176.68 

e42.gb.akamaiedge.net, 72.247.176.237 e211.g.akamaiedge.net, 72.247.176.71 

e45.g.akamaiedge.net, 72.247.176.239 e213.na.akamaiedge.net, 72.247.176.120 

e94.g.akamaiedge.net, ... 








Cluster 2: 


-, 90.156.145.198 spzrin, 90.156.145.198 vwui.in, 90.156.145.198 x9e.ru, 90.156.145.50 
v2802.vps.masterhost.ru, 90.156.145.167 www.inshaker.ru, 90.156.145.198 x7l.ru, 
90.156.145.198 c3q.at, 90.156.145.198 Itkq.in, 90.156.145.198 x7d.ru, 
90.156.145.198 zdlz.in, 90.156.145.159 www.designcollector.ru, 90.156.145.198 
x7o.ru, 90.156.145.198 gq5Sc.ru, 90.156.145.159 designtwitters.com, 90.156.145.198 
u5d.ru, 90.156.145.198 x9d.ru, 90.156.145.198 xb8.ru, 90.156.145.198 xg8.ru, 
90.156.145.198 x8m.ru, 90.156.145.198 shopfilmworld.cn, 90.156.145.198 
bigappletopworld.cn, 90.156.145.198 uppd.in, ... 


Each element of the cluster is a domain name - IP. ad- 
dress pair. These two groups of domains belonged to the 
same network cluster, but were separated into two different 
clusters by the zone-based clustering phase. Cluster ] con- 
tains domain names belonging to Akamai’s CDN, while the 
domains in Cluster 2 are all related to malicious websites that 
distribute malicious software. The two clusters of domains 
share similar network characteristics, but have significantly 
different zone-based features. For example, consider domain 
names d;=“e55.g.akamaiedge.net” from the first cluster, and 
dg=“‘spzr.in” from the second cluster. The reason why d, and 
dy were clustered in the same network-based cluster is because 
the set of RHIPs (see Section 3.1) for d; and dz have similar 
characteristics. In particular, the network agility properties of 
dz make it look like if it was part of a large CDN. However, 
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Figure 6. The output from the network profiling 
module, the domain clustering module and the ev- 
idence vector will assist the reputation function to 
assign the reputation score to the domain d. 


when we consider the set of RHDNs for d,; and ds, we can 
notice that the zone-based features of d; are much more “‘sta- 
ble’ than the zone-based features of dz. In other words, while 
the RHDNs of d; share strong domain name similarities (e.g., 
they all share the substring “‘akamai’’) and have low variance of 
the string features (see Section 3.2.2), the strong zone agility 
properties of dz affect the zone-based features measured on 
dz’s RHDNs and make dz look very different from dj. 

One of the main advantages of Notos is the reliable as- 
signment of low reputation scores to domain names partici- 
pating in “agile” malicious campaigns. Less agile malicious 
campaigns, e.g., Fake AVs campaigns may use domain names 
structured to resemble CDN related domains. Such strate- 
gies would not be beneficial for the FakeAV campaign, since 
domains like virus-scanl.com, virus-scan2.com, 
etc., can be trivially blocked by using simple regular expres- 
sions [16]. In other words, the attackers need to introduce 
more “agility” at both the network and domain name level in 
order to avoid simple domain name blacklisting. Notos would 
only require a few labeled domain names belonging to the ma- 
licious campaign for training purposes, and the reputation en- 
gine would then generalize to assign a low reputation score to 
the remaining (previously unknown) domain names that be- 
long to the same malicious campaign. 


3.3.4 Building the Reputation Function 


Once we build a model of well-known network profiles (see 
Section 3.3.2) and the domain clusters (see Section 3.3.3), we 
can build the reputation function. The reputation function will 
assign a reputation score in the interval [0, 1] to domain names, 
with 0 meaning low reputation (1.e., likely malicious) and 1 
meaning high reputation (1.e., likely legitimate). We imple- 
ment our reputation function as a statistical classifier. In order 
to train the reputation function, we consider all the domain 
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names d;,2 = 1,..,7 1n Notos’ knowledge base, and we feed 
each domain d; to the network profiles module and to the do- 
main clusters module to compute two output vectors N M (d;) 
and DC(d;), respectively. We explain the details of how 
N M(d;) and DC (d;) are computed later in Section 3.3.5. For 
now it sufficient to consider NM (d;) and DC (d;) as two fea- 
ture vectors. For each d; we also compute an evidence fea- 
tures vector EV (d;), as described in Section 3.2.3. Let u(d;) 
be a feature vector that combines the NM (d;), DC(d;), and 
EV (d;) feature vectors. We train the reputation function us- 
ing the labeled dataset L = {(v(d;), y:) }iz1..n, where y; = 0 
if d; is a known malicious domain name, otherwise y; = 1. 


3.3.5 On-Line Mode 


After training is complete; the reputation engine can be used 
in on-line mode (Figure 3) to assign a reputation score to new 
domain names. For example, given an input domain name 
d, the reputation engine computes a score S € [0,1]. Val- 
ues of S close to zero mean that d appears to be related to 
malicious activities and therefore has a low reputation. On 
the other hand, values of S close to one signify that d ap- 
pears to be associated with benign Internet services, and there- 
fore has a high reputation. The reputation score is computed 
as follows. First, d is fed into the network profiles module, 
which consists of five statistical classifiers, as discussed in 
Section 3.3.2. The output of the network profiles module is 
a vector NM(d) = {c1,€2,..-,¢5}, where c, is the output of 
the first classifier, and can be viewed as the probability that 
d belongs to the class of Popular Domains, cz is the proba- 
bility that d belongs to the class of Common Domains, etc. 
At the same time, d is fed into the domain clusters module, 
which computes a vector DC(d) = {lj, lo,...,15}. The ele- 
ments /; of this vector are computed as follows. Given d, we 
first extract its network-based features and identify the closest 
network-based cluster to d, among the network-based clusters 
computed by the domain clusters module during the off-line 
mode (see Section 3.3.3). Then, we extract the zone-based 
statistical features and identify the zone-based cluster closest 
to d. Let this closest domain cluster be Cg. At this point, we 
consider all the zone-based feature vectors v; € Cg, and we 
select the subset of vectors V; C Cg for which the two fol- 
lowing conditions are verified: i) dist(zg,v;) < R, where 24 
is the zone-based feature vector for d, and R is a predefined 
radius; 11) v; € K NN(zqa), where kN N(zaq) is the set of k 
nearest-neighbors of zq. 

The feature vectors in V, are related to domain names ex- 
tracted from Notos’ knowledge base. Therefore, we can assign 
a label to each vector v; € Vg, according to the nature of the 
domain name d from which v; was computed. The domains in 
Notos’ knowledge base belong to different classes. In particu- 
lar, we distinguish between eight different classes of domains, 
namely Popular Domains, Common Domains, Akamai, CDN, 
and Dynamic DNS, which have the same meaning as explained 
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in Section 3.3.2, and Spam Domains, Flux Domains, and Mal- 
ware Domains. 

In order to compute the output vector DC (d), we compute 
the following five statistical features: the majority class label 
L (e.g., LE may be equal to Malware Domain), 1.e., the label 
that appears the most among the vectors v; © Vg; the stan- 
dard deviation of label frequencies, 1.e., given the occurrence 
frequency of each label among the vectors v; € Vg we com- 


pute their standard deviation; given the subset viv) C Vq of 
vectors in V, that are associated with label L, we compute 
the mean, median and standard deviation of the distribution 
of distances between zq and the vectors v; € vi), 


3.3.6 Assigning Reputation Scores 


Given a domain d, once we compute the vectors N M(d) and 
DC*(d;) as explained above, we also compute the evidence 
vector EV (d) as explained in Section 3.2.3. At this point, we 
concatenate these three feature vectors into a sixteen dimen- 
sional feature vector v(d), and we feed v(d) in input to our 
trained reputation function (see Section 3.3.4). The reputa- 
tion function computes a score S = 1 — f(d), where f(d) can 
be interpreted as the probability that d is a malicious domain 
name. S varies in the [0, 1] interval, and the lower the value of 
S, the lower d’s reputation. 


4 Data Collection and Analysis 


This section summarizes observations from passive DNS 
measurements, and how professional, legitimate DNS services 
are distinguished from malicious services. These observations 
provided the ground truth for our dynamic domain name rep- 
utation system. We also provide an intuitive example to illus- 
trate these properties, using a few major Internet zones like 
Akamai and Google. 


4.1 Data Collection 


The basic building block for our dynamic reputation rating 
system is the historical or “passive” information from success- 
ful A-type DNS resolutions. We use the DNS traffic from 
two ISP-based sensors, one located on the US east coast (At- 
lanta) and one located on the US west coast (San Jose). Addi- 
tionally we use the aggregated DNS traffic from the different 
networks covered by the SIE [3]. In total, our database col- 
lected 27,377,461 unique resolutions from all these sources 
over a period of 68 days, from 19%” of July 2009 to 24%” 
September 2009. 

Simple measurements performed on this large data set 
demonstrate a few important properties leveraged by our se- 
lected features. After just a few days the rate of new, unique 
pDNS entries leveled off. The graph in Figure 7(b) shows 
only about 100,000 to 150,000 new domains/day (with a brief 
outage issue on the 53”@ day), despite very large numbers of 
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Figure 7. Various RRs growth trends observed in the pDNS DB over a period of 68 days 


RRs arriving each day (shown in Figure 7(a)). This suggests 
that most RRs are duplicates, and approximately after the first 
few days, 94.7% — on average — from the unique RRs ob- 
served in daily base at the sensor level are already recorded by 
the passive DNS database. Therefore, even a relatively small 
pDNS database may be used to deploy Notos. In Section 5, we 
measure the sensitivity of our system to traffic collected from 
smaller networks. 

The remaining plots in Figure 7 show the daily growth of 
our passive DNS database, from the point of view of five dif- 
ferent zone classes. Figure 7(c) and (d) show the growth rate 
associated with CDN networks (Akamai, and all other CDNs). 
The number of unique IPs stays nearly constant with the num- 
ber of unique domains (meaning that each new RR is a new 
IP and a new child domain of the CDN). In a few weeks, most 
of the IPs became known—suggesting that one can fully map 
CDNs in a modest training set. This is because CDNs, al- 
though large, always have a fixed number of IP addresses used 
for hosting their high-availability services. Intuitively, we be- 
lieve this would not be the case with malicious CDNs (e.g., 
flux networks), which use randomly spreading infections to 
continually recruit new IPs. 

The ratio of new IPs to domains diverges in Figure 7(e), 
a plot of the rate of newly discovered RRs for popular web- 
sites (e.g., Google, Facebook). Facebook notably uses unique 
child domains for their Web-based chat client, and other top 
Internet sites use similar strategies (encoding information in 
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the domain, instead of the URI), which explains the growth 
in domains shown in Figure 7(e). These popular sites use a 
very small number of IPs, however, and after a few weeks of 
training our pDNS database identified all of them. Since these 
popular domains make up a large portion of traffic in any trace, 
our intuition is that simple whitelisting would significantly re- 
duce the workload of a classifier. 

Figure 7(f) shows the rate of pDNS growth for zones in 
Dynamic DNS providers. These services, sometimes used by 
botmasters, demonstrate a nearly matched ratio of new IPs to 
new domains. The data excludes non-routable answers (e.g., 
dynamic DNS domains pointing to 127.0.0.1), since this con- 
tains no unique network information. Intuitively, one can think 
of dynamic DNS as a nearly complete bijection of domains to 
IPs. Figure 7(g) shows the growth of RRs for alexa.com 
top 100 domains. Unlike dynamic DNS domains, these points 
to a small set of unique addresses, and most can be identified 
in a few weeks’ worth of training. 

A comparison of all the zone classes appears in Figure 7(h), 
which shows the cumulative distribution of the unique RRs de- 
tailed in Figure 7(c) through (g). The different rates of change 
illustrate how each zone class has a distinct pattern of RR use: 
some have a small IP space and highly variable domain names; 
some pair nearly every new domain with a new IP. Learning 
approximately 90% of all the unique RRs in each zone class, 
however, only requires (at most) tens of thousands of distinct 
RRs. The intuition from this plot is that, despite the very large 
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data set we used in our study, Notos could potentially work 
with data observed from much smaller networks. 


4.2 Building The Ground Truth 


To establish ground truth, we use two different labeling 
processes. First, we assigned labels to RRs at the time of their 
discovery. This provided an initial static label for many do- 
mains. Blacklists, of course, are never complete and always 
dynamic. So our second labeling process took place during 
evaluation, and monitored several well-known domain black- 
lists and whitelists. 

The data we used for labeling came from several sources. 
Our primary source of blacklisting came from services 
such as malwaredomainlist.com and malwaredo- 
mains.com. In order to label IP addresses in our pDNS 
database we also used the Sender Policy Block (SBL) list from 
Spamhaus [18]. Such IPs are either known to send spam or 
distribute malware. We also collected domain name and IP 
blacklisting information from the Zeus tracker [30]. All this 
blacklisting information was gathered before the first day of 
August 2009 (during all the 15 days in which we collected 
passive DNS data). Since blacklists traditionally lag behind 
the active threat, we continued to collect all new data until the 
end of our experiments. 

Our limited whitelisting was derived from the top 500- 
alexa.com domain names, as of the 1** of August 2009. We 
reasoned that, although some malicious domains become pop- 
ular, they do not stay popular (because of remediation), and 
never break into the top tier of domain rankings. Likewise, we 
used a list of the 18 most common 2LDs from various CDNs, 
which composed the main corpus of our CDN labeled RRs. 
Finally a list of 464 dynamic DNS second level domains al- 
lowed us to identify and label domain name and IPs coming 
from zones under dynamic DNS providers. We label our eval- 
uation (or testing) data-set by aggregating updated blacklist 
information for new malicious domain names and IPs from 
the same lists. 

To compute the honeypot features (presented in Sec- 
tion 3.2.3) we need a malware analysis infrastructure that can 
process as many “new” malware samples as possible. Our 
honeypot infrastructure is similar to “Ether” [4] and is capa- 
ble of processing malware samples in a queue. Every malware 
sample was analyzed in a controlled environment for a time 
period of five minutes. This process was repeated during the 
last 15 days of July 2009. After 15 days of executions we 
obtained a set of successful DNS resolutions (domain names 
and IPs) that each malware looked up. We chose to execute 
malware and collect DNS evidence through the same period 
of time in which we aggregate the passive DNS database. Our 
virtual machines are equipped with five popular commercial 
anti-virus engines. If one of the engines identifies an exe- 
cutable as malicious, we capture all domain names and the 
corresponding IP mappings that the malware used during ex- 
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ecution. After excluding all domain names that belong to the 
top 500 most popular alexa.com zones, we assemble the 
main corpus of our “honeypot data’. We automated the crawl- 
ing and collection of black list information and honeypot exe- 
cution. 

The reader should note that we chose to label our data in 
as transparent way as possible. We used public blacklisting 
information to label our training dataset before we build our 
models and train the reputation function. Then we assigned 
the reputation scores and validated the results again using the 
same publicly available blacklist sources. It is safe to as- 
sume that private IP and DNS blacklist will contain significant 
more complete information with lower FP rates than the public 
blacklists. By using such type of private blacklist the accuracy 
of Notos’ reputation function should improve significantly. 


5 Results 


In this section, we present the experimental results of our 
evaluation. We show that Notos can identify malicious domain 
names sooner than public blacklists, with a low false posi- 
tive rate (FP%) of 0.38% and high true positive rate (TP%) 
of 96.8%. As a first step, we computed vectors based on 
the statistical features (described in Section 3.2) from 250,000 
unique RRs. This volume corresponds to the average volume 
of new — previously unseen — RRs observed at two recursive 
DNS servers in a major ISP in one day, as noted in Section 4, 
Figure 7(b). These vectors were computed based on historic 
passive DNS information from the last two weeks of DNS traf- 
fic observed on the same two ISP recursive resolvers in Atlanta 
and San Jose. 


5.1 Accuracy of Network Profile Modeling 


The accuracy of the Meta-Classification system (Fig- 
ure 4(a)) in the network profile module is critical for the over- 
all performance of Notos. This is because, in the on-line mode, 
Notos will receive unlabeled vectors which must be classified 
and correlated with what is already present in our knowledge 
base. For example, if the classifier receives a new RR and as- 
signs to it the label Akamai with very high confidence, that 
implies the RR which produced this vector will be part of a 
network similar to Akamai. However, this does not necessar- 
ily mean that it is part of the actual Akamai CDN. We will see 
in the next section how we can draw conclusions based on the 
proximity between labeled and unlabeled RRs within the same 
zone-based clusters. Furthermore, we discuss the accuracy 
of the Meta-Classifier when modeling each different network 
profile class (profile classes are described in Section 3.3.2). 

Our Meta-Classifier consists of five different classifiers, 
one for each different class of domains we model. We chose to 
use a Meta-Classification system instead of a traditional sin- 
gle classification approach because Meta-Classification sys- 
tems typically perform better than a single statistical classi- 
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Figure 8. ROC curves for all network profile 
classes shows the Meta-Classifier’s accuracy. 


fier [11, 2]. Throughout our experiments this proved to be 
also true. The ROC curve in Figure 8, shows that the Meta- 
Classifier can accurately classify RRs for all different network 
profile classes. 


The training dataset for the Meta-Classifier is composed 
of sets of 2,000 vectors from each of the five network profile 
classes. The evaluation dataset is composed of 10,000 vectors, 
2,000 from each of the five network profile classes. The classi- 
fication results for the domains in the Akamai, CDN, dynamic 
DNS and Popular classes showed that the supervised learn- 
ing process in Notos is accurate, with the exception of a small 
number of false positives related to the Common class (3.8%). 
After manually analyzing these false positives, we concluded 
that some level of confusion between the vectors produced by 
Dynamic DNS domain names and the vectors produced by 
domain names in the Common class still remains. However, 
this minor misclassification between network profiles does not 
significantly affect the reputation function. This is because 
the zone profiles of the Common and Dynamic DNS domain 
names are significantly different. This difference in the zone 
profiles will drive the network-based and zone-based cluster- 
ing steps to group the RRs from Dynamic DNS class and Com- 
mon class in different zone-based clusters. 


Despite the fact that the network profile modeling process 
provides accurate results, it doesn’t mean this step can inde- 
pendently designate a domain as benign or malicious. The 
clustering steps will assist Notos to group vectors not only 
based their network profiles but also based on their zone prop- 
erties. In the following section we show how the network and 
zone profile clustering modules can better associate similar 
vectors, due to properties of their domain name structure. 
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Figure 9. The ROC curve from the reputation func- 
tion indicating the high accuracy of Notos. 


5.2 Network and Zone-Based Clustering Results 


In the domain name clustering process (Section 3.3.3, Fig- 
ure 4(b)) we used X-Means clustering in series, once for the 
network-based clustering and again for the zone-based clus- 
tering. In both steps we set the minimum and maximum num- 
ber of clusters to one and the total number of vectors in our 
dataset, respectively. We run these two steps using different 
numbers of zone and network vectors. Figure 11 shows that 
after the first 100,000 vectors are used, the number of network 
and zone clusters remains fairly stable. This means that by 
computing at least 100,000 network and zone vectors—using 
a 15-day old passive DNS database—we can obtain a stable 
population of zone and network based clusters for the moni- 
tored network. We should note that reaching this network and 
cluster equilibrium does not imply that we do not expect to 
see any new type of domain names in the ISP’s DNS recur- 
sive. This just denotes that based on the RRs present in our 
passive DNS database, and the daily traffic at the ISP’s recur- 
sive, 100,000 vectors are enough to reflect the major network 
profile trends in the monitored networks. Figure 11 indicates 
that a sample set of 100,000 vectors may represent the major 
trends in a DNS sensor. It is hard to safely estimate the exact 
minimum number of unique RRs that is sufficient to identify 
all major DNS trends. An answer to this should be based upon 
the type, size and utilization of the monitored network. With- 
out data from smaller corporate networks it is difficult for us 
to make a safe assessment about the minimum number of RR 
necessary for reliably training Notos. 

The evaluation dataset we used consisted of 250,000 unique 
domain names and IPs. The cluster overview is shown in Fig- 
ure 10 and in the following paragraphs we discuss some in- 
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Figure 10. With the 2-step clustering step, Notos 
is able to cluster large trends of DNS behavior. 


teresting observations that can be made from these network- 
based and zone-based cluster assignments. As an example, 
network clusters 0 and 1 are predominantly composed of zones 
participating in fraudulent activities like spam campaigns (yel- 
low) and malware dropping or C&C zones (red). On the other 
hand, network clusters 2 to 5 contain Akamai, dynamic DNS, 
and popular zones like Google, all labeled as benign (green). 
We included the unlabeled vectors (blue) based on which we 
evaluated the accuracy of our reputation function. We have a 
sample of unlabeled vectors in almost all network and zone 
clusters. We will see how already labeled vectors will assist 
us to characterize the unlabeled vectors in close proximity. 


Before we describe two sample cases of dynamic charac- 
terization within zone-based clusters, we need to discuss our 
radius R and k value selection (see Section 3.3.5). In Sec- 
tion 3.3.5, we discuss how we build domain name clusters. 
At that point we introduced the dynamic characterization pro- 
cess that gives Notos the ability to utilize already label vectors 
in order to characterize a newly obtained unlabeled vector by 
leveraging our prior knowledge. After looking into the distri- 
bution of Euclidean distances between unlabeled and labeled 
vectors within the same zone clusters, we concluded that in the 
majority of these cases the distances were between 0 and 1000. 
We tested different values of the radius R and the value of k 
for the K-nearest neighbors (KNN) algorithm. We observed 
that the experiments with radius values between 50 and 200 
provided the most accurate reputation rating results, which we 
describe in the following sections. We also observed that if 
k; > 25 the accuracy of the reputation function is not affected 
for all radius values between 50 and 200. Based on the results 
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Figure 11. By using different number of network 
and zone vectors we observe that after the first 
100,000, there is no significant variation in the ab- 
solute number of produced clusters during the 1°’ 
and 2” level clustering steps. 


of these pilot experiments, we decided to set & equal to 50 and 
the radius distance equal to 100. 

Figures 12 and 13 show the effect of this radius selection 
on two different types of clustering problems. In Figure 12, 
unknown RRs for akamaitech.net are clustered with a 
labeled vector akamai.net. As noted in Section 4, CDNs 
such as Akamai tended to have new domain names with each 
RR, but to also reuse their IPs. By training with only a small 
set of labeled akamai.net RRs, our classifier put the new, 
unknown RRs for akamaitech.net into the existing Aka- 
mai class. [P-specific features therefore brought the new RRs 
close to the existing labeled class. Figure 12 compresses all 
of the dimensions into a two-dimensional plot (for easier vi- 
sual representation), but it is clear the unknown RRs were all 
within a distance of 100 to the labeled set. 

This result validates the design used in Section 4, where 
just a few weeks’ worth of labeled data was necessary for 
training. Thus, one does not have to exhaustively discover all 
whitelisted domains. Notos is resilient to changes in the zone 
classes we selected. Services like CDNs and major web sites 
can add new IPs or adjust domain formats, and these will be 
automatically associated with a known labeled class. 

The ability of Notos to associate new RRs based on lim- 
ited labeled inputs is demonstrated again in Figure 13. In 
this case, labeled Zeus domains (approximately 2,900 RRs 
from three different Zeus-related BLs) were used to clas- 
sify new RRs. Figure 13 plots the distance between the la- 
beled Zeus-related RRs and new (previously unknown) RRs 
that are also related Zeus botnets. As we can see from 
Section 4, most of the new (unlabeled) Zeus RRs lay very 
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Figure 12. An example of characterizing the aka- 
maitech.net unknown vectors as benign based on 
the already labeled vectors (akamai.net) present 
in the same cluster. 


close, and often even overlap, to known Zeus RRs. This 
is a good result, because Zeus botnets are notoriously hard 
to track, given the botnet’s extreme agility. Tracking sys- 
tems such as zeustracker.abuse.ch and malware- 
domainlist.com have limited visibility into the botnet, 
and often produce disjoint blacklists. Notos addresses this 
problem, by leveraging a limited amount of training data to 
correctly classify new RRs. During our evaluation set, Notos 
correctly detected 685 new (previously unknown) Zeus RRs. 


5.3. Accuracy of the Reputation Function 


The first thing that we address in this section is our deci- 
sion to use a Decision Tree using Logit-Boost strategy (LAD) 
as the reputation function. Our decision is motivated by the 
time complexity, the detection results and the precision (true 
positives over all positives) of the classifier. We compared 
the LAD classifier to several other statistical classifiers using 
a typical model selection procedure [6]. LAD was found to 
provide the most accurate results in the shortest training time 
for building the reputation function. As we can see from the 
ROC curve in Figure 9, the LAD classifier exhibits a low false 
positive rate (FP%) of 0.38% and true positive rate (TP%) of 
96.8%. It is was noting that these results were obtained using 
10-fold cross-validation, and the detection threshold was set 
to 0.5. The dataset using for the evaluation contained 10,719 
RRs related to 9,530 known bad domains. The list of known 
good domains consisted of the top 500 most popular domains 
according to Alexa. 

We also benchmarked the reputation function on other two 
datasets containing a larger number of known good domain 
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Figure 13. An example of how the Zeus botnet 
clusters during our experiments. All vectors are 
in the same network cluster and in two different 
zone clusters. 


names. We experimented with bot the top 10,000 and top 
100,000 Alexa domain names. The detection results for these 
experiments are as follows. When using the top 10,000 Alexa 
domains, we obtained a true positive rate of 93.6% and a false 
positive rate of 0.4% (again using 10-fold cross-validation and 
a detection threshold equal to 0.5). As we can see, these results 
are not very different from the ones we obtained using only 
the top 500 Alexa domains. However, when we extended our 
list of known good domains to include the top 100,000 Alexa 
domain names, we observed a significant decrease of the true 
positive rate and an increase in the false positives. Specifically, 
we obtained a TP% of 80.6% and a FP% of 0.6%. We believe 
this degradation in accuracy may be due to the fact that the 
top 100,000 Alexa domains include not only professionally 
run domains and network infrastructures, but also include less 
good domain names, such as file-sharing, porn-related web- 
sites, etc., most of which are not run in a professional way and 
have disputable reputation!. 

We also wanted to evaluate how well Notos performs, com- 
pared to static blacklists. To this end, we performed a number 
of experiments as follows. Given an instance of Notos trained 
with data collected up to July 31, 2009, we fed Notos with 
250,000 distinct RRs found in DNS traffic we collected on 
August 1, 2009. We then computed the reputation score for 
each of these RRs. First, we set the detection threshold to 0.5, 
and with this threshold we identified 54,790 RRs that had a 
low reputation (lower than the threshold). These RRs where 


'A quick analysis of the top 100,000 Alexa domains reported that about 
5% of the domains appeared in the SURBL (www. surbl.org) blacklist, at 
certain point in time. A more rigorous evaluation of these results is left to 
future work. 
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Figure 14. Dates in which various blacklists con- 
firmed that the RRs were malicious after Notos 
assigned low reputation to them on the 1°’ of 
August. 


related to a total of 10,294 distinct domain names (notice that 
a domain name may map to more than one IP, and this ex- 
plains the higher number of RRs). Of these 10,294 domains, 
7,984 (77.6%) appeared in at least one of the public black- 
lists we used for comparison (see Section 4) within 60 day 
after August 1, and were therefore confirmed to be malicious. 
Figure 14(a) reports the number and date in which RRs classi- 
fied as having low reputation by Notos appeared in the public 
blacklists. The remaining three plots (Figure 14(b), (c) and 
(d)), report the same results organized according to the type of 
malicious domains. In particular, it is worth noting that Notos 
is able to detect never-before-seen domain names related to the 
Zeus botnet several days or even weeks before they appeared 
in any of the public blacklists. 

For the remaining 22.4% of the 10,294 domains we consid- 
ered, we were not able to draw a definitive conclusion. How- 
ever, we believe many of those domains are involved in some 
kind of more or less malicious activities. We also noticed 
that 7,980 or the 7,984 confirmed bad domain names were 
assigned a reputation score lower or equal to 0.15, and that 
none of the other non-confirmed suspicious domains received 
a score lower than this threshold. In practice, this means that 
an operator who would like to use Notos as a stand-alone dy- 
namic blacklisting system while limiting the false positives to 
a negligible (or even zero) amount may fine-tune the detection 
threshold and set it around 0.15. 


5.4 Discussion 
This section discusses the limits of Notos, and the poten- 


tial for evasion in real networks. On of the main limitations 
is the fact that Notos is unable to assign reputation scores for 
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domain names with very little historic (passive DNS) informa- 
tion. Sufficient time and a relatively large passive DNS collec- 
tion are required to create an accurate passive DNS database. 
Therefore, if an attacker always buys new domain names and 
new address space, and never reuses either resource for any 
other malicious purposes, Notos will not be able to accurately 
assign a reputation score to the new domains. In the [Pv4 
space, this is very unlikely to happen due to the impending ex- 
haustion of the available address space. Once IPv6 becomes 
the predominant protocol, however, this may represent a prob- 
lem for the statistical features we extract based on IP granular- 
ity. However, we believe the features based on BGP prefixes 
and AS numbers would still be able to capture the agility typ- 
ical of malicious DNS hosting behavior. 


As long as newly generated domain names share some net- 
work properties (e.g., IPs or BGP prefixes) with already la- 
beled RRs, Notos will be able to assign an accurate reputa- 
tion score. In particular, since network resources are finite and 
more expensive to renew or change, even if the domain prop- 
erties change, Notos can still identify whether a domain name 
may be associated with malicious behavior. In addition, if a 
given domain name for which we want to know the reputation 
is not present in the passive DNS DB, we can actively probe it, 
thus forcing a related passive DNS entry. However, this is pos- 
sible only when the domain successfully maps to a non-empty 
set of IPs. 


Our experimental results using the top 10,000 Alexa do- 
main names as known good domains, report a false positive 
fate of 0.4%. While low in percentage, the absolute number of 
false positives may become significant in those cases in which 
very large numbers of new domain names are fed to Notos on 
a daily bases (e.g., in case of deployment in a large ISP net- 
work). However, we envision our Notos reputation system to 
be use not as a stand-alone system, but rather in cooperation 
with other defense mechanisms. For example, Notos may be 
used in collaboration with spam-filtering system. If an email 
contains a link to a website whose domain name has a low rep- 
utation score according to Notos, the spam filter can increase 
the total spam-score of the email. However, if the rest of the 
email appears to be benign, the spam filter may still decide to 
accept the email. 


During our manual analysis of (a subset of) the false pos- 
itives encountered in our evaluations we were able to draw 
some interesting observation. We found that a number of le- 
gitimate sites (e.g., goldsgym.com) are being hosted in net- 
works that host large volumes of malicious domain names in 
them. In this cases Notos will tend to penalize the reputation 
of this legitimate domains because they reside in a bad neigh- 
borhood. In time, the reputation score assigned to these do- 
mains score may change, if the administrators of the network 
in which the benign domain name are hosted take actions to 
“clean up” their networks and stop hosting bad domain names 
within their address space. 
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5 213.182.197.229 
analf.net 222.186.31.169 
pro-buh.ru 89.108.67.83 
ammdamm.cn 92.241.162.55 
briannazfunz.com 95.205.116.55 
mybank-of.com 59.125.229.73 


oc00co.com 212.117.165.128 
avangadershem.com 195.88.190.29 

securebizccenter.cn 122.70.145.140 
adobe-updating-service.cn | 59.125.231.252 
Omd.ru 219.152.120.118 
avrev.info 98.126.15.186 

g0Oglee.cn 218.93.202.100 





Table 1. Sample cases form Zeus domains de- 
tected by Notos and the corresponding days 
that appeared in the public BLs. All evidence 
information in this table were harvested from 
zeustracker.abuse.ch. 


6 Conclusion 


In this paper, we presented Notos, a dynamic reputation 
system for DNS. To the best of our knowledge, Notos is the 
first system that can assign a dynamic reputation score to any 
domain name in a DNS query that traverses the edge of a 
monitored network. Notos harvests information from multiple 
sources such as the DNS zone domain names belongs to, the 
related IP addresses, BGP prefixes, AS information and hon- 
eypot analysis to maintain up-to-date DNS information about 
legitimate and malicious domain names. Based on this infor- 
mation, Notos uses automated classification and clustering al- 
gorithms to model network and zone behaviors of legitimate 
and malicious domains, and then applies these models to com- 
pute a reputation score for a (new) domain name. 

Our evaluation using real-world data, which includes traf- 
fic from large ISP networks, demonstrates that Notos is highly 
accurate in identifying new malicious domains in the moni- 
tored DNS query traffic, with a true positive rate of 96.8% and 
false positive rate of 0.38%. In addition, Notos is capable of 
identifying these malicious domain weeks or even months be- 
fore they appear in public blacklists, thus enabling proactive 
security countermeasures against cyber attacks. 
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94.23.198.97 - 
213.251.176.169 

antivirprotect.com | 64.40.103.249 

lspeed.info 212.117.163.165 

spy-destroyer.com | 67.211.161.44 

free-spybot.com 63.243.188.110 

a3l.at 89.171.115.10 


gidromash.cn 211.95.79.170 
lantivirus-pro.com | 188.40.52.180 
ericwanhouse.cn 220.196.59.19 
1165651291.com 212.117.165.126 





Table 2. Anecdotal cases of malicious domain 
names detected by Notos and the correspond- 
ing days that appeared in the public BLs .[1): 
hosts-file.net, [2]: malwareurl.com, [3] siteadvisor.com, [4] 
virustotal.com, [5] ddanchev.blogspot.com, [6] malwaredo- 
mainlist.com 
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Abstract 


On November 3, 2009, voters in Takoma Park, Mary- 
land, cast ballots for the mayor and city council members 
using the Scantegrity II voting system—the first time 
any end-to-end (E2E) voting system with ballot privacy 
has been used in a binding governmental election. This 
case study describes the various efforts that went into 
the election—including the improved design and imple- 
mentation of the voting system, streamlined procedures, 
agreements with the city, and assessments of the experi- 
ences of voters and poll workers. 

The election, with 1728 voters from six wards, in- 
volved paper ballots with invisible-ink confirmation 
codes, instant-runoff voting with write-ins, early and 
absentee (mail-in) voting, dual-language ballots, provi- 
sional ballots, privacy sleeves, any-which-way scanning 
with parallel conventional desktop scanners, end-to-end 
verifiability based on optional web-based voter verifica- 
tion of votes cast, a full hand recount, thresholded author- 
ities, three independent outside auditors, fully-disclosed 
software, and exit surveys for voters and pollworkers. 

Despite some glitches, the use of Scantegrity II was 
a success, demonstrating that E2E cryptographic voting 
systems can be effectively used and accepted by the gen- 
eral public. 


1 Introduction 


The November 2009 municipal election of the city of 
Takoma Park, Maryland marked the first time that any- 
one could verify that the votes were counted correctly in 
a secret ballot election for public office without having 
to be present for the entire proceedings. This article is a 
case study of the Takoma Park election, describing what 
was done—from the time the Scantegrity Voting Sys- 
tem Team (SVST) was approached by the Takoma Park 
Board of Elections in February 2008, to the last crypto- 
graphic election audit in December 2009—and what was 
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learned. While the paper provides a simple summary of 
survey results, the focus of this paper is not usability but 
the engineering process of bringing a new cryptographic 
approach to solve a complex practical problem involving 
technology, procedures, and laws. 

With the Scantegrity II voting system, voters mark op- 
tical scan paper ballots with pens, filling the oval for 
the candidates of their choice. These ballots are handled 
as traditional ballots, permitting all the usual automated 
and manual counting, accounting, and recounting. Ad- 
ditionally, the voting system provides a layer of integrity 
protection through its use of invisible-ink confirmation 
codes. When voters mark ballot ovals using a decoder 
pen, confirmation codes printed in invisible ink are re- 
vealed. Interested voters can note down these codes to 
check them later on the election website. The codes are 
generated randomly for each race and each ballot, and 
hence do not reveal the corresponding vote. A final tally 
can be computed from the codes and the system provides 
a public digital audit trail of the computation. 

Election audits in Scantegrity II are not restricted to 
privileged individuals and can be performed by voters 
and other interested parties. Developers and election au- 
thorities are unable to significantly falsify an election 
outcome without an overwhelming probability of an au- 
dit failure [8]. The other side of the issue of integrity, 
also solved by the system, is that false claims of impro- 
priety in the recording and tally of the votes are readily 
revealed to be false. ! 

All the software used in the election—for ballot au- 
thoring, printing, scanning and tally—was published 
well in advance of the election as commented, buildable 
source code, which may be a first in its own right. More- 
over, commercial off-the-shelf scanners were adapted to 
receive ballots in privacy sleeves from voters, making the 


'Note that a threat present and not commonly addressed in paper 
ballot systems is that additional marks could be added to ballots by 
those with special access. Such attacks are made more difficult by 
Scantegrity II. 
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overall system relatively inexpensive. 

Despite several limitations of the implementation, we 
found that the amount of extra work needed by officials 
to use Scantegrity II while administering an election is 
acceptable given the promise of improved voter satisfac- 
tion and indisputability of the outcome. Indeed, discus- 
sions are ongoing with the Board of Elections of the city 
regarding continued use of the system in future elections. 

Another observation from the election is that the elec- 
tion officials and voters surveyed seemed to appreciate 
the system. Since voters who do not wish to verify can 
simply proceed as usual, ignoring the codes revealed in 
the filled ovals, the system is least intrusive for these vot- 
ers. Those voters who did check their codes, and even 
many who did not, seem to appreciate the opportunity. 

This paper describes the entire process of adapting the 
Scantegrity II system to handle the Takoma Park elec- 
tion, including the agreement with the city, printing the 
special ballots with invisible-ink confirmation codes, ac- 
tually running the election, and verifying that the election 
outcome was correct. 


Organization of this case study The next section pro- 
vides an overview of related work in this area, summa- 
rizing previous experiments with Scantegrity I and other 
E2E systems in practical settings. 

Section 3 describes in more detail the setting for the 
election: giving details about Takoma Park and their 
election requirements. Section 4 gives more details of 
the Scantegrity II voting system, including a description 
of how one can “audit” an election. Section 5 provides 
an overview of the implementation of the voting system 
for the November 3, 2009 Takoma Park municipal elec- 
tion, including the scanner software, the cryptographic 
back-end, and the random-number generation routines. 

Section 6 gives a chronological presentation and time- 
line of the steps taken to run the November election, 
including the outcome of the voter verification and the 
audits. It also gives the results of the election, with 
some performance and integrity metrics. Section 7 re- 
ports some results of the exit surveys taken of voters and 
pollworkers. 

Section 8 discusses the high-level lessons learned from 
this election. Section 9 provides some conclusions, ac- 
knowledgements, and disclosures required by the pro- 
gram committee. 


2 Related Work 


Chaum was the first to propose the use of cryptogra- 
phy for the purpose of secure elections [5]. This was 
followed by almost two decades of work in improving 
security and privacy guarantees (for a nice survey, see 
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Adida [1]), most recently under the rubric of end-to-end 
voting systems. These voting system proposals provide 
integrity (any attempt to change the tally can be caught 
with very high probability by audits which are not re- 
stricted to privileged individuals) and ballot secrecy. 

The first of these proposals include protocols by 
Chaum [6] and Neff [19], which were implemented soon 
after (Chaum’s as Citizen- Verified Voting [16] and Neff’s 
by VoteHere). Several more proposals with prototypes 
followed: Prét ad Voter [10], Punchscan [21, 15], the pro- 
posal of Kutylowski and Zagorski [18] as Voting Ducks, 
and Simple Verifiable Voting [4] as Helios [2] and Vote- 
Box [24]. 

Making end-to-end systems usable in real elections 
has proven to be challenging. We are aware of the follow- 
ing previous binding elections held using similar verifi- 
cation technology: the Punchscan elections for the grad- 
uate students’ union of the University of Ottawa (2007) 
and the Computer Professionals for Social Responsibil- 
ity (2007); the Riynland Internet Election System (RIES) 
public elections in the Netherlands in 2004 and 2006; the 
Helios elections of the Recteur of Université Catholique 
de Louvain [3] (2009) and the Princeton undergraduate 
student government election (2009), as well as a student 
election using Prét a Voter. 

Only the RIES system has been used in a governmen- 
tal election; however, it is meant for remote (absentee) 
voting and, consequently, does not offer strong ballot se- 
crecy guarantees. For this reason, it has been recom- 
mended that the RIES system not be used for regular 
public elections [17, 20]. Helios is also a remote vot- 
ing system, and offers stronger ballot secrecy guarantees 
over RIES. The Punchscan elections were the closest to 
this study, but they did not rise to the level of public 
elections. They did not have multiple ballot styles, the 
users of the system were not a broad cross-segment of 
the population as in Takoma Park, the system implemen- 
tors were deeply involved in administering the elections, 
and no active auditors were established to audit the elec- 
tions. To date, this study is the most comparable use case 
of E2E technology to that of a typical optical scan elec- 
tion. 

The case study reported here is based on a series of 
systems successively developed, tested, and deployed by 
a team of researchers included among the present au- 
thors originating with the Punchscan system. Although 
it used paper ballots, the Punchscan system did not al- 
low manual recounts, a feature that the team recognized 
as needing to be designed into the next generation of 
systems. The result was Scantegrity [9], which retained 
hand-countable ballots, and was tested in a number of 
small elections. With Scantegrity, however, it was too 
easy to trigger an audit that would require scrutiny of the 
physical ballots. The Scantegrity IT system [7, 8], de- 
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ployed in Takoma Park, was a further refinement to ad- 
dress this problem by allowing a public statistical test of 
whether voter complaints actually reflect a discrepancy 
or whether they are without basis. Note: in the rest of 
the paper, “Scantegrity” refers to the voting team or to 
the Scantegrity I voting system; which one is typically 
easily determined from context. 

As part of the Scantegrity agreement with Takoma 
Park (see section 3), a “mock election” [26] was held 
in April 2009 to test and demonstrate feasibility of the 
Scantegrity system during Takoma Park’s annual Arbor 
day celebration. Volunteer voters voted for their favorite 
tree. A number of revisions and tweaks to the Scant- 
egrity system were made as a result of the mock elec- 
tion, including: ballot revisions (no detachable chit, but 
instead a separate voter verification card), pen revisions 
(two-ended, with different sized tips), scanner station re- 
visions (better voter flow, no monitor, two scanners), pri- 
vacy sleeve (no lock, no clipboard, folding design, feeds 
directly into scanner), and confirmation codes (three dec- 
imal digits). 


3 The Setting 


For several reasons, the implementation of voting sys- 
tems is a difficult task. Most voting system users— 
i.e. the voters—are untrained and elections happen infre- 
quently. Voter privacy requirements preclude the usual 
sorts of feedback and auditing methods common in other 
applications, such as banking. Also, government regula- 
tions and pre-existing norms in the conduct of elections 
are difficult to change. These issues can pose significant 
challenges when deploying new voting systems, and it 
is therefore useful to understand the setting in which the 
election took place. 


About Takoma Park The city of Takoma Park is lo- 
cated in Montogomery County, Maryland, shares a city 
line with Washington, D.C, and is governed by a mayor 
and a six-member City Council. The city has about 
17,000 residents? and almost 11,000 registered voters 
[27, pg. 10]. A seven-member Board of Elections con- 
ducts local elections in collaboration with the City Clerk. 
In the past, the city has used hand counts and optical scan 
voting, as well as DREs for state elections. 

The Montgomery County US Census Update Data 
of 2005 provides some demographic information about 
the city. Median household income in 2004 was 
$48,675. The percentage of households with comput- 
ers was 87.4%, and about 32% of Takoma Park residents 
above the age of twenty-five had a graduate, professional 
or doctoral degree. It is an ethnically diverse city: 45.8% 


*See http://www. takomaparkmd.gov/about.html. 
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of its residents identify their race as “White,” 36.3% as 
“Black,” 9.7% as “Asian or Pacific Islander” and 8.2% as 
“Other” (individuals of Hispanic origin form the major 
component of this category). Further, 44.4% of its house- 
holds have a foreign-born head of household or spouse, 
and 44.8% of residents above the age of five spoke a lan- 
guage other than English at home. 


Instant Runoff Voting (IRV) Takoma Park has used 
IRV in municipal city elections since 2006. IRV is a 
ranked choice system where each voter assigns each can- 
didate a rank according to her preferences. The rules? 
used by Takoma Park (and the Scantegrity software) for 
counting IRV ballots are relatively standard, so we omit 
further discussion for lack of space. 


Agreement with the City As with any municipal gov- 
ernment in the US, Takoma Park is allowed to choose its 
own voting system for city elections. For county, state, 
and federal elections, it 1s constrained by county, state, 
and federal election laws. 

Takoma Park and the SVST signed a Memorandum 
of Understanding (MOU), in which the SVST agreed 
to provide equipment, software, training assistance, and 
technical support. The City of Takoma Park agreed to 
provide election-related information on the municipality, 
election workers, consumable materials, and perform or 
provide all other election duties or materials not provided 
by us. No goods or funds were exchanged. 

According to the MOU, if approved by the city coun- 
cil, the election was to be conducted in compliance with 
all applicable laws and policies of the city. This included 
using Instant Runoff Voting as defined by the City of 
Takoma Park Municipal Charter. 

The SVST also agreed to pursue an accessible ballot- 
marking device for the election, but was later relieved of 
satisfying this requirement. Unfortunately, Scantegrity 
is not yet fitted with a voter interface for those with vi- 
sual or motor disabilities, and accessible user interfaces 
were also not used in Takoma Park’s previous optical 
scan elections. 


Timeline Scantegrity was approached by the Takoma 
Park Board of Elections in late February 2008, and, after 
considering other voting systems, the Board voted to rec- 
ommend a contract with Scantegrity in June 2008. Fol- 
lowing a public presentation to the City Council in July 
2008, the MOU was signed in late November 2008, about 
nine months after the initial contact. 


>For the exact laws used by Takoma Park, see page 22 of http: 
//www.takomaparkmd.gov/code/pdf/charter.pdf. Sec- 
tion (f), concerning eliminating multiple candidates, was used in our 
implementation for tie-breaking only. 
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The SVST held an open workshop in February 2009 to 
discuss the use of Scantegrity in both the mock and real 
elections. This workshop was held at the Takoma Park 
Community Center and was attended by Board of Elec- 
tion members, the City Clerk, current members (and a 
retired member) from the Montgomery County Board of 
Elections, as well as a representative each from the Pew 
Trust and Fair Vote. Following the mock election in April 
2009, the SVST proposed a redesigned system taking 
into consideration feedback from voters and poll work- 
ers (through surveys) and the Board of Elections. The 
Board voted to recommend use of the redesigned system 
in July 2009; this was made official in the city election 
ordinance in September 2009. * Beginning around June 
2009, a meeting with representatives of the SVST was 
on the agenda of most monthly Board of Election meet- 
ings. Additionally, SVST members met many times with 
the City Clerk and the Chair of the Board of Elections to 
plan for the election. 

The final list of candidates was available approxi- 
mately a month before the election, on October 2. The 
Scantegrity meetings initializing the data and ballots 
were held in October (see Section 6), as was a final work- 
shop to test the system. Absentee ballots were sent out 
by the City Clerk in the middle of October. The SVST 
delivered ballots to the City Clerk in late October, and 
early voting began almost a week before the election, on 
October 28. Poll worker training sessions were held by 
the city on October 28 and 31, and polling on November 
3, 2009, from 7 am to 8 pm. The final Scantegrity audits 
were completed on 17 December 2010; all auditors were 
of the opinion that the election outcomes were correct 
(for details see section 6). 


4 Scantegrity Overview 


In this section, we give an overview of the Scantegrity 
system. For more detailed descriptions, see [7, 8]. 


Voter Experience Atahigh level, the voter experience 
is as follows. First, a voter checks in at the polling place 
and receives a Scantegrity ballot (See Figure 2) with a 
privacy sleeve. The privacy sleeve is used to cover the 
ballot and keep private the contents of the ballot. Inside 
the voting booth, there is a special “decoder pen’ and a 
stack of blank “voter verification cards.” The voter uses 
the decoder pen to mark the ballot. As on a conventional 
optical scan ballot, she fills in the bubble next to each of 
her selections. Marking a bubble with the decoder pen 
simultaneously leaves a dark mark inside the bubble and 


4See http: //www.takomaparkmd.gov/clerk/agenda/ 
items/2009/090809-3.pdf, section 2-D, page 2. 
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reveals a previously hidden confirmation code printed in 
invisible ink. 

If the voter wishes to verify her vote later on the elec- 
tion website, she can copy her ballot ID and her revealed 
confirmation codes onto a voter verification card. She 
keeps the verification card for future reference. She then 
takes her ballot to the scanning station and feeds the bal- 
lot into an optical scanner, which reads the ballot ID and 
the marked bubbles. 

If a voter makes a mistake, she can ask a poll worker 
to replace her ballot with a new one. The first ballot is 
marked “spoiled,” and its ballot ID is added to the list of 
spoiled ballot [Ds maintained by the election judges. 

The voter can verify her vote on the election website 
by checking that her revealed confirmation codes and 
ballot ID have been posted correctly. If she finds any 
discrepancy, the voter can file a complaint through the 
website, within a complaint period. When filing a com- 
plaint, the voter must provide the confirmation codes that 
were revealed on her ballot as evidence of the validity of 
the complaint. 


Ballots The Scantegrity ballot looks similar to a con- 
ventional optical scan ballot (see Figure 2 for a sam- 
ple ballot used in the election). It contains a list of the 
choices and bubbles beside each choice. Marking a bub- 
ble reveals a random 3-digit confirmation code. 


Confirmation Codes The confirmation codes are 
unique within each contest on each ballot, and are gener- 
ated independently and uniformly pseudorandomly. The 
confirmation code corresponding to any given choice on 
any given ballot is hidden and unknown to any voter until 
the voter marks the bubble for that choice. 


Digital Audit Trail Prior to the election, a group of 
election trustees secret-share a seed to a pseudorandom 
number generator (PRNG). The trustees then input their 
shares to a trusted workstation to generate the pseudo- 
random confirmation codes for all ballots, as well as a 
set of tables of cryptographic commitments to form the 
digital audit trail. These tables allow individual voters to 
verify that their votes have been included in the tally, and 
allow any interested party to verify that the tally has been 
computed correctly, without revealing how any individ- 
ual voter voted. 


Auditing After the election, any interested party can 
audit the election by using software to check the correct- 
ness of the data and final tally on the election website. 
Additionally, at the polling place on the day of the elec- 
tion, any interested party can choose to audit the printing 
of the ballots. A print audit consists of marking all of the 
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bubbles on a ballot, and then either making a photocopy 
of the fully-marked ballot or copying down all of the re- 
vealed confirmation codes. The ballot ID is recorded by 
an election judge as audited. After the election, one can 
check that all of the confirmation codes on the audited 
ballot, and their correspondence with ballot choices, are 
posted correctly on the election website. 


5 Implementation 


The election required a cryptographic backend, a scan- 
ner, and a website. These 3 components form the ba- 
sic election system and their interaction is described in 
Figure 1. In addition, Takoma Park required software to 
resolve write-in candidate selections and produce a for- 
matted tally on election night. 

Scantegrity protects against manipulation of election 
results and maintains, but does not improve, the privacy 
properties of optical scan voting systems that use se- 
rial numbers. To compromise voter privacy using Scant- 
egrity features, an attacker must associate receipts to vot- 
ers and determine what confirmation numbers are as- 
sociated to each candidate. This is similar to violat- 
ing privacy by other means; for example, an attacker 
could compromise the scanner and determine the order 
in which voters used the device, or examine physical 
records and associate serial numbers to voters. The scan- 
ner and backend components protect voter privacy, but 
the website and the write-in candidate resolver do not 
because they work with public information only. 

Each component is written in Java. We describe the 
implementation and functions of each one in the follow- 
ing sections. 


Backend The cryptographic backend that provides the 
digital audit trail is a modified version of the Punchscan 
backend [21]. This backend is written in Java 1.5 using 
the BouncyCastle cryptography library. > Key manage- 
ment in the Punchscan backend is handled by a simple 
threshold [25] cryptosystem that asks for a username and 
password from the election officials. 

We chose the Punchscan backend over newer propos- 
als [7] because it had already been implemented and 
tested in previous elections [13, 28]. At the interface be- 
tween the Scantegrity frontend and the Punchscan back- 
end, as described in [23], the permutations used by 
Punchscan are matched to a permutation of precomputed 
confirmation codes for Scantegrity that correspond to the 
permutation of codes printed on the ballot. 

The Punchscan backend uses a two-stage m1x process 
based on cryptographic commitments published before 
the election. Each mix, the left mix and the right mix, 


Shttp://www.bouncycastle.org 
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takes marked positions as input, shuffles the ballots, and 
reorders each marked position on each ballot according 
to a prescribed (pre-committed) permutation. The result 
is the set of cleartext votes, where position 0 corresponds 
to candidate 0, 1 to 1, etc. Between the two mixes, for 
example, position 0 may in fact correspond to candidate 
5, depending on the permutation in the right mix. 

The Punchscan backend partitions [22] each contest 
such that each contest is treated as an independent elec- 
tion with a separate set of commitments. In the case of 
Takoma Park, each ward race and the mayor’s race are 
treated as separate elections. (The announcement of sep- 
arate mayoral race vote counts for each ward is required 
by Takoma Park). The scanner is responsible for creating 
the input files for each individual election. 

Election officials hold a series of meetings using the 
backend to conduct an election. Before the election, dur- 
ing Meeting J (Initialization), they choose passwords that 
are shares of a master key that generates all other data for 
the election in a deterministic fashion. After each meet- 
ing, secret data (such as the mapping from confirmation 
codes to candidates) is erased from the hard drive and re- 
generated from the passwords when it is needed again. 
In Meeting | the backend software creates a digital au- 
dit trail by committing to the Punchscan representation 
of candidate choices and to the mixset: the left and right 
mix operations for each ballot. Later, during Meeting 2 
(Pre-Election Audit), the backend software responds to 
an audit of the trail demonstrating that the mixset de- 
crypts ballots correctly. At this time, the backend also 
commits to the Scantegrity front-end, consisting of the 
linkage between the Scantegrity front-end and its Punch- 
scan backend used for decryption. 

After the election, election officials run Meeting 3 (Re- 
sults), publishing the election results and the voted con- 
firmation numbers. For the purposes of the tally audit, 
the system also publishes the outputs of the left and right 
mixes. In Meeting 4 (Post-Election Audit), officials re- 
spond to the challenges of the tally computation audit. 
Either the entire /eft mix or the entire right mix opera- 
tions are revealed, and the auditor checks them against 
data published in Meeting 3. 

The Meeting 4 audit catches, with probability one half, 
a voting system that cheats in the tally computation. To 
provide higher confidence in the results, the backend cre- 
ates multiple sets of left and right mixes; in Takoma Park, 
we created 40 sets for each election, 20 of which were 
audited. Given 2 contests per ballot and 40 sets of left 
and right mixes, there are a total of 160 commitments 
per ballot in the audit trail, in addition to a commitment 
per contestant per ballot for each confirmation number 
(15-18, depending on the Ward). 

The implementation uses two classes of “random” 
number sources. The first is used to generate the dig- 
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Figure 1: Election Workflow. The core election work flow in Scantegrity is similar to an optical scan election: 
a software backend creates ballot images that are printed, used by voters, and scanned. The results are fed to the 
backend which creates the tally. The audit capacity is provided by 3 extra steps: (1) create the initial digital audit trail 
and audit a portion of it, (2) audit the ballots to ensure correctness when printing, and (3) audit the final tally. 


ital audit trail, and the second is used for auditing the 
trail. Both types of sources must be unpredictable to an 
adversary, and we describe each in turn. 

Digital Audit Trail The Punchscan backend generates 
the mixes and commitments using entropy provided by 
each election official during initialization of the thresh- 
hold encryption. This provided a “seed” for a pseudo- 
random number generator (based on the SHA256 hash 
function). 

We also used this random source to generate the con- 
firmation numbers when changing the Punchscan back- 
end to support Scantegrity. Unfortunately, we introduced 
an error in the generation when switching from alphanu- 
meric to numeric confirmation numbers as a result of 
findings in the Mock election (see Section 2). This re- 
sulted in approximately 8.5 bits of entropy as opposed to 
the expected 10 bits. We discovered this error after we 
started printing and it was too late to regenerate the audit 
trail. 

The error increased the chance that an adversary could 
guess an unseen confirmation code to approximately one 
in 360 rather than the intended one in 1000; a small de- 
crease in the protection afforded against malicious voters 
trying to guess unseen codes in order to discredit the sys- 
tem. 

Auditing Random numbers are needed to generate 
challenges for the various auditing steps (print audit, ran- 
domized partial checking). These numbers should be un- 
predictable in advance to an adversary. They should also 
be “verifiable” after the fact as having come from a “truly 
random” source that is not manipulable by an adversary. 

We chose to use the closing prices of the stocks in 
the Dow Jones Industrial Average as our verifiable but 
unpredictable source to seed the pseudorandom number 
generator (the use of stock prices for this purpose was 
first described in [11]). These prices are sufficiently un- 
predictable for our purposes, yet verifiable after the fact. 
However, it turns out that post-closing “adjustments” can 
sometimes be made to the closing prices, which can 
make these prices less than ideal for our purposes in 
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terms of verifiability. 


Scanner Software The original intent of Scantegrity 
was to build on top of an existing optical scan system. 
There was no pre-existing optical scan system in use at 
Takoma Park, so we implemented a simple system using 
EeePC 900 netbooks and Fujitsu 6140 scanners. 


The scanning software is written in Java 1.6. It uses a 
bash shell script to call the SANE scanimage program ° 
and polls a directory on the filesystem to acquire bal- 
lot images. Once an image is acquired it uses circular 
alignment marks to adjust the image, reads the barcode 
using the ZXing QRCode Library, ’ and uses a simple 
threshold algorithm to determine if a mark is made on 
the ballot. 


Individual races on each ballot are identified by ward 
information in the barcode, which is non-sequential and 
randomly generated. The ballot id in the barcode and 
the web verification numbers on each ballot are different 
numbers, and the association between each number type 
is protected by the backend system. Write-in candidate 
areas, if that candidate is selected by the voter, are stored 
as clipped raw images with the ballot scan results. Ballot 
scan results are stored in a random location in a memory 
mapped file. 


The current implementation of the scanning software 
does not protect data in transit to the backend, which 
poses a risk for denial of service. Checking of the cor- 
rectness of the scanner is done through the Scantegrity 
audit. The data produced by the scanner does not com- 
promise voter privacy, but—assuming an attacker could 
intercept scanner data—voter privacy could be compro- 
mised at the scanner through unique write-in candidates 
on the ballot, through a compromised scanner, by bugs 
in the implementation, or by relying on the voter to make 
readable copies of the barcode to get a ballot id. 


Shttp://www.sane-project.org/ 
Thttp://code.google.com/p/zxing/ 
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Tabulator/Write-In Software At the request of 
Takoma Park we created an additional piece of software, 
the Election Resolution Manager (ERM), that allows 
election judges to manually determine for each write-in 
vote what candidate the vote should be counted toward. 
The other responsibility of the ERM is to act as part of 
the backend. It collates data from each scanner and pre- 
pares the input files for the backend. 


To resolve write-ins with this software, the user cy- 
cles through each image, and either types in the name of 
the intended candidate or selects the name from a list of 
previously identified candidates composed of the original 
candidates and any previously typed candidate names. 
The user is not shown the whole ballot, so he does not 
know what the other selections are on that ballot, or what 
rank the write-in was given. We call this process resolv- 
ing a vote because the original vote is changed from the 
generic “Write-In” candidate to the candidate that was 
intended by the voter. The ERM produces a PDF of 
each image, the candidate selection for that image, and 
a unique number to identify the selection. 

Scantegrity handles write-in candidates just like other 
optical scan systems by treating the write-in position 
as a candidate. Therefore, the backend does not know 
how each write-in position was resolved, and two results 
records are created: one with write-in resolution pro- 
vided by the ERM, and one without write-in resolution 
provided by the backend. 


To check the additional record generated by the ERM, 
an observer reduces the resolved results record and veri- 
fies that the set of resolved ballots is the same as the set of 
unresolved ballots. To audit that the judges chose the cor- 
rect candidates for each write-in, the observer refers to 
the PDF generated during write-in resolution. The PDF 
allows the observer to reference each resolved ballot en- 
try in the resolved results file and verify that the image 
was properly transcribed. 

One caveat of this approach is that if a write-in candi- 
date wins, a malicious authority could modify these im- 
ages to change results, but could not deny that the write- 
in position had received a winning number of votes. This 
situation would require additional procedures to verify 
the write-ins (e.g. a hand count, and/or careful audit of 
the transcriptions by each judge). 


Website Beyond communicating the election outcome 
itself, the role of the election website 1s to serve as a “‘bul- 
letin board” (BB) to broadcast the cryptographic audit 
data set (1.e., cryptographic commitments, responses to 
audit challenges, etc). In addition, voters can use this 
website to check their receipts, and file a dispute if the 
receipt is misreported. We provided an implementation 
with these features written in Java 1.6. It used the Stripes 
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Framework ® and an Apache Derby database backend. ” 
In practice, we only used part of this implementation. 

Originally, our plan was to have Takoma Park host the 
website, but officials chose a hybrid approach where they 
hosted election information and results. That website 
would link to our server to provide a receipt checking 
tool and audit data. After the election, officials would 
provide us with a copy of the public data files to pub- 
lish. This decision caused a number of changes to our 
approach. 

We decided to only use the receipt checking code from 
the implementation, and, to make downloading more 
convenient for auditors, post all election data on our pub- 
licly available subversion repository. !° Additionally, 
both auditors agreed to mirror the data. 

A primary security requirement for the Scantegrity 
BB is to provide authenticated broadcast communication 
from election officials to the public. We met this require- 
ment with digital signatures. A team member (Carback) 
created signed copies of each file with gnupg !! using his 
public key from May 28, 2009. 

Without authenticated communication, it would be im- 
possible to prove if different results were provided to dif- 
ferent people. Our specific approach to the website re- 
quires observers to verify signatures and check with each 
other if they receive identical copies of the data (and ver- 
ify the consistency of the signatures over time). Our au- 
ditors, Adida and Zagorski, performed these actions, but 
we do not know the extent of this communication other- 
wise. As usual with our approach to Scantegrity, we are 
enabling detection of errors (genuine or malicious). 

There are several potential threats to the bulletin board 
model—we will briefly enumerate some of them. At a 
high level, threats pertain primarily to misreporting of 
results, or to voter identification. With regard to results 
reporting, an adversary may attempt to misreport results 
by substituting actual election data with false data. In 
the event that all parties verify signatures of information 
they receive, and check consistency with the signed files, 
incorrect confirmation codes on the bulletin board would 
be detected by voters, and incorrect computation of the 
tally by anyone checking the tally computation audit. If 
the voter checking confirmation codes does not check 
consistency with the rest of the bulletin board (by, for ex- 
ample, downloading the bulletin board data, checking all 
the signatures and checking that his or her confirmation 
code is also correctly noted in the entire bulletin board 
data) he or she may be deceived into believing their bal- 
lot was accurately recorded and counted. Similarly, if 


8http://www.stripesframework.org/ 
*http://db.apache.org/derby/ 
lWnttp://scantegrity.org/svn/data/ 
takoma-nov3-2009/ 
Nnttp://www.gnupg.org/ 
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the various signatures are not cross checked across indi- 
viduals or observed over time, an adversary may replace 
the confirmation codes after they have been checked, or 
send different ones to voters and to auditors. An adver- 
sary may also attempt an identification attack, whereby 
the objective is to link voter identities with receipt data, 
such as by recording IP addresses of voters who check 
their receipts. 


6 The Election 


In this section, we describe the election as events unfold 
chronologically over time. 


6.1 Preparations 


Preparations for the election include running the first 2 
backend meetings, and creating the ballot. 


Independent Auditors The Board of Elections re- 
quested cryptographers Dr. Ben Adida (Center for Re- 
search on Computation and Society, Harvard University) 
and Dr. Filip Zagoérski (Institute of Mathematics and 
Computer Science, Wroclaw University of Technology, 
Poland) to perform independent audits of the digital data 
published by Scantegrity in general, and of the tally com- 
putation in particular. Dr. Adida !* and Dr. Zagérski !° 
maintained websites describing the audits and the results 
of the audits, and Dr. Adida also blogged the audit. '4 
Before the election, Dr. Adida pointed out several in- 
stances when the Scantegrity information was insuffi- 
cient; Scantegrity documentation was updated as a result. 

The Board of Elections also requested Ms. Lillie 
Coney (Associate Director, Electronic Privacy Informa- 
tion Center and Public Policy Coordinator for the Na- 
tional Committee for Voting Integrity (NCVI)) to per- 
form print audits on Election Day. Ms. Coney chose 
ballots at random through the day, exposed the confir- 
mation codes for all options on the ballot, and kept these 
with her until after the end of the complaint period, when 
Scantegrity opened commitments to all unvoted and un- 
spoiled ballots (and hence to all ballots she had audited). 
Ms. Coney then checked that the correspondence be- 
tween codes and confirmation numbers on her ballots 
matched those on the website. 

Both tasks, of print audits and digital data audits, can 
be performed by voters. Digital data audits can also be 
performed by any observers. In future elections, when 
the general population and Takoma Park voters are more 


l2http://sites.google.com/site/ 
takomapark2009audit/ 

Bnttp://zagorski.im.pwr.wroc.pl/scantegrity/ 

Mnttp://benlog.com/articles/category/ 
takoma-park-2009/ 
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familiar with end-to-end elections, it is anticipated that 
voters (and, in particular, candidate representatives) will 
perform such audits. 


Meeting 1 Four election officials (the City Clerk, the 
Chair, Vice Chair and a member of the Board of Elec- 
tions: Jessie Carpenter, Anne Sergeant, Barrie Hofmann 
and Jane Johnson, respectively) were established as elec- 
tion trustees in Meeting J, held on October 12 2009. 

It was explained to the trustees that, through their pass- 
words, they would generate the confirmation codes and 
share the secret used to tally election results. Further, 
it was explained that, without more than a threshold of 
passwords, the election could not be tallied by Scant- 
egrity, and that if a threshold number of passwords was 
not accessible (if they were forgotten, for example, or 
trustees were unavailable due to sickness) the only avail- 
able counts would be manual counts. A threshold of two 
trustees was determined based on anticipated availabil- 
ity of the officials, and it was explained that two trustees 
could collude to determine the correspondence between 
confirmation numbers and codes, and hence that each 
trustee should keep her password secret. 

The trustees generated commitments to the decryption 
paths for each of 5000 ballots per ward (for six wards). 
Scantegrity published the commitments on October 13 
2009 at 12:13am. 


Meeting 2. In Meeting 2, held on October 14, 2009, 
trustees used Scantegrity-written code to respond to chal- 
lenges generated using stock market data at closing on 
October 14. Half of the ballot decryption paths commit- 
ted to in Meeting J were opened. Additionally, trustees 
constructed ballots (associations between candidates and 
confirmation codes) at this meeting, and generated com- 
mitments to them. Scantegrity published the stock mar- 
ket data, the challenges, and the responses. 


Ballot Design The ballot used for the 2009 election 
was based on ballots used for the 2007 election. We 
made the conscious choice to modify (as little as pos- 
sible) a design already used successfully in a past elec- 
tion, and not to use the ballot we had designed for the 
mock election. The main reason for reusing the ballot 
design was that it would be familiar to voters. The ballot 
was required to contain instructions in both English and 
Spanish: marking instructions, instructions for write-ins, 
instructions for IRV and any Scantegrity-related instruc- 
tions (see Figure 2). 


Printing Ballots We use “invisible” ink to print the 
marking positions that reveal confirmation codes to vot- 
ers. We used refillable inkjet cartridges in multiple color 
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Tear-off line 


City of Takoma Park, Maryland 
MUNICIPAL ELECTION 
NOVEMBER 3, 2009 


OFFICIAL BALLOT — WARD 1 


Instructions: Vote for candidates by indicating your first- 
choice candidate, your second-choice candidate, and so on. 
You are free to rank only a first choice if you wish. 


Do not fill in more than one oval per column. Do not fill in 
more than one oval per candidate. Do not skip numbers in 
the ranking sequence. 


To vote for a person whose name is not printed on the bal- 
lot, write the name in the space provided and fill in one box 
in the column indicating your ranking of the write-in candi- 
date. 


If you make a mistake on your ballot, return it to the judge 
and get another. 


Do not make any identifying marks on your ballot. 
When you mark an oval to rank a candidate, a code will 


be revealed that you may later use to verify your vote 
online. See the instruction sheet in the voting booth. 











MAYOR 





Rank candidates in order of choice 










Josh Wright 





Alignment mark 


ALCALDE 

Rank candidates in order of choice det choice: | 2nd chotes | 3rd choles 
Clasifique a los candidatos por orden de preferencia | 1raopcién | 2da opcién | 3ra opcién 
Roger B. Schlegel J 
Bruce Williams a 


Write-In Candidate/Para afadir a un candidato — 


CITY COUNCIL MEMBER WARD 1 
MIEMBRO DEL CONSEJO DE LA CIUDAD DISTRITO ELECTOBA 
: 2nd choice 
Clasifique a los candidatos por orden de preferencia tcp | nd te 


Write-in Candidate/Para afadir a un candidato 





Ward number 
> 1392060 
Stub Number: 


Ciudad de Takoma Park, Maryland 
ELECCIONES MUNICIPALES 
3 DE NOVIEMBRE DE 2009 


BOLETA OFICIAL— DISTRITO ELECTORAL 1 


Instrucciones: Vote por los candidatos indicando el! candidato que 
sea su primera opcidn, el candidato que sea su segunda opcién, y 
asi sucesivamente. Si lo desea, puede limitarse a seleccionar 
solamente al candidato que sea su primera opcidn. 


No rellene mas de una casilla por cada columna. No rellene mas 
de una casilla por cada candidato. No salte numeros en la secuencia 
de clasificacion por orden. 


Para votar por una persona cuyo nombre no esté impreso en /a 
boleta, escriba el nombre en el espacio provisto y rellene una casilla 
en la columna para indicar el orden de clasificacién del candidato que 
se ha afiadido. 


Si usted comete un error en su boleta, devuélvasela al juez y pida 
otra. 


No haga marcas en su boleta que puedan identificarlo. 


Cuando usted marque /a casilla para votar por un candidato, 
vera un cédigo que podra usar posteriormente para verificar su 
voto por Internet. Vea /a hoja de instrucciones en la cabina de 
votacion. 






NYO IO Fit 


= 
= 
S 









Reactive ink, 
— darkens when 
marked with 
pen 
if 
= 
2D machine- 


readable bar code 


1-634527 


For voter to look Up . Online Verification Number/ 


online 


Numero de Verificacion por Internet 


Figure 2: An unmarked Takoma Park 2009 ballot for Ward | showing instructions in Spanish and English, the options, 
the circular alignment marks, the 2D barcode, the ballot serial number (on the stub, meant for poll workers to keep 
track of the number of ballot used) and the online verification number (for voters to check their codes). The true ballot 
was printed on legal size paper and was hence larger than shown. 
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positions of an Epson R280 printer to print confirmation 
codes. The ink is not actually invisible, but looks like 
a yellow bubble before marking and a dark bubble with 
light yellow codes after marking. 

We initially began printing with 6 printers, but they 
proved unreliable. It was our expectation that using large 
amounts of commodity hardware would scale, but it did 
not. We did not anticipate the number of failure modes 
we experienced and our printing process was delayed by 
approximately | and a half days. 


Ballot Delivery Mail-in (absentee) ballots were deliv- 
ered to the City Clerk on 16 October. Early, in-person 
voting ballots were delivered on October 27 for early vot- 
ing on October 28, and all other ballots a couple of days 
later on October 30. 

Absentee ballots were identical to in-person voting 
ballots except they did not contain online verification 
numbers and voters were not given any instructions on 
checking confirmation numbers online. They were re- 
turned by mail in double envelopes and scanned with 
the early votes. Confirmation numbers for these ballots 
were, however, made available online after scanning, so 
that there was no distinction in published data between 
absentee and in-person voted ballots. 

The board decided to issue ballots without confirma- 
tion numbers due to the small number of anticipated ab- 
sentee votes and the costs associated with mailing ballots 
with special pens. Mailing the ballots with confirmation 
codes would allow verification of confirmation codes, but 
opens up new attacks: the possibility of false charges of 
election fraud by adversaries who might expose confir- 
mation codes and reprint ballots, or use expensive equip- 
ment to attempt to determine the invisible codes. Strong 
verification for absentee ballots 1s an ongoing research 
subject within the Scantegrity team. 

Early in-person voters used Scantegrity ballots with 
all Scantegrity functionality, except that the early votes 
were scanned in after the polls closed on Election Day, 
and not by voters themselves. Voters were, however, 
provided verification cards and could check confirmation 
codes for these ballots online. 


Poll Worker Training Several training sessions were 
held in the weeks prior to the election. Manuals from the 
previous election were updated and a companion guide 
was created with Scantegrity-specific instructions. Elec- 
tion judges were given these two manuals, and a member 
from our team demonstrated the voting process at one 
session. 


See http://scantegrity.org/~carback1l/ink for 
more information on the printing process 
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Voter Education Voter education for this election fo- 
cused on online verification. Articles in the City news- 
paper before the real election indicated that voters could 
check confirmation numbers online; this was also an- 
nounced on the city’s election website. '° 


Scanner Setup We attempted to minimize, not pre- 
vent, '’ the potential for using the wrong software by 
installing our software on top of Ubuntu Linux on SD 
flash cards, setting the “read-only” switch on each card, 
and setting up the software to read and write to USB 
sticks. We fingerprinted the first card after testing with 
the shalsum utility and cloned it to a second card for 
the other netbook. Each netbook was set to boot from 
the card and BIOS configuration was locked with a pass- 
word. 

Both flash cards were checked with the shalsum utility 
then placed into the netbook which was placed into a lock 
box and delivered to Takoma Park. The USB sticks were 
initialized with scanner configuration files. We uniquely 
identified each scanner by changing the ScannerID field 
in the configuration files, then we placed the correspond- 
ing USB sticks (3 for each netbook) into the lock box. 

Upon delivery of the scanners the day before the elec- 
tion, we gave election officials the lock box keys and 
showed them how to open the lock boxes. We confirmed 
with election officials the contents of each box and the 
officials verified, with our assistance, that the USB mem- 
ory sticks did not contain any ballot data by looking at 
the configuration file and making sure the ballot data file 
was blank. '* To protect against virus infection on the 
sticks we set them to read-only for this procedure. 


6.2 Election Day 


On Election Day, November 3, 2009, polls were open 
from 7 am to 8 pm at a single polling location, the 
Takoma Park Community Center. Several members of 
the SVST were present through most of the day in the 
building in case of technical difficulty. One SVST mem- 
ber was permitted in the polling room at most times as an 
observer, and a couple of SVST members were present 
in the vestibule giving out and collecting survey forms 
through most of the day. Lillie Coney of the Electronic 
Privacy Information Center, who performed a print audit 
on the request of the Board of Elections, was present in 
the polling room through a large part of the day. 


Mnttp://www.takomaparkmd.gov/clerk/election/ 
2009/ 

'7Scantegrity would detect manipulation at the scanner. A better 
solution would use trusted hardware technology (e.g. a TPM [14]). 

'8These were the only 2 files on the disk at this time. Additionally, 
election officials did not check fingerprints on the flash cards. Since no 
3rd party had reviewed the code or fingerprinted it they relied on our 
chain of custody. 
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Starting the Election The scanner was the only SVST 
equipment to set up and it was a turn key system. Elec- 
tion judges needed to plug in the USB sticks and power 
on the netbooks. The scanner was attached to a scan- 
ning apparatus, and cables were run into the lockbox that 
contained the netbook. When ready, the scanner would 
beep 3 times. After reading a ballot, the scanner would 
beep | time. During shutdown, the scanner would beep 
another 3 times. If there were any failure modes the scan- 
ner would beep continuously or not beep at all. 

Election judges set up the check-in tables, pollbooks, 
and voting booths. The election started on time. 


Voting The election proceeded quite smoothly, with 
very few (small) glitches. An SVST member was able 
to assist polling officials in fixing a problem with their 
poll books (not provided by Scantegrity). Voters had 
some initial problems with the use of the scanner and 
the privacy sleeve, some seeking assistance from elec- 
tion judges who also had difficulty. After an explanation 
to the election judges by the Chair of the Board of Elec- 
tions, the use of the scanner was considerably smoother. 
With a few ballots, the privacy sleeve was not letting 
go of the ballots; one ballot was mangled considerably 
but scanned fine. Seventeen scanned ballots had lines on 
them that caused the scanner to be unable to read votes, 
and one ballot had alignment marks manipulated such 
that it was also unreadable. Images of all unreadable 
scans are saved, so we were able to manually enter in 
these votes. Of the seventeen ballots, many ballots had a 
line in the same location, which is consistent with there 
being a foreign substance on a ballot put into the scan- 
ner. These problems did not affect our ability to count 
the votes. 

During the day, Ms. Coney chose about fifty ballots at 
random, uniformly distributed across wards, and exposed 
the confirmation codes for all options for the ballots. A 
copy of each ballot was made for her to take with her; 
the copies were signed by the Chair of the BoE. Neither 
Ms. Coney nor SVST members had any interaction with 
voters. 

Towards the end of the day, after the local NPR sta- 
tion carried clips from an interview with the Chair of the 
Board of Elections and a voter, the polling station saw a 
large increase in the number of voters, with the line tak- 
ing up much of the floor outside the polling room. The 
SVST prepared to print more ballots, but this was not re- 
quired. The number of printed ballots ended up being 
almost twice the number of voted ballots. 

Absentee and early voted ballots were scanned in af- 
ter the closing of polls. Afterward, the scanners were 
shut down. The chief judge opened each lock box, set 
all sticks to read only, removed 2 USB sticks (leaving 
the third with the scanning netbook), and locked the lock 
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box. Our team was given | stick for the ERM system. 
The other was kept by the city. 

In Meeting 3a, trustees used Scantegrity code to gen- 
erate results without provisional ballots at about 10 pm. 
The Chair of the Board of Election announced the results 
to those present at the polling place at the time (including 
candidates, their representatives, voters, etc.); this was 
also televised live by the local TV station. Confirma- 
tion codes and the election day tally were posted on the 
Scantegrity website. 


6.3 After the Election 


On the next day, around 2 pm, results including verified 
provisional ballots were published. Takoma Park rep- 
resentatives had announced a tally without provisional 
ballots the night before, and followed with the tally that 
included verified provisionals in accordance with stan- 
dard Takoma Park procedures. The final Meeting 3 re- 
sults were published on November 4th just before mid- 
night. 

The number of registered voters were 10,934 and 1728 
votes were cast (15.8%). The city-certified final tally for 
each contest is provided in Table 1. In each race, a ma- 
jority was won after tallying after the voter’s first choice. 


Hand Count and Certification Following a hand 
count performed by representatives from both the SVST 
and Takoma Park, the Chair of the Board of Elections 
certified the results of the hand count to the City Council 
at 7 pm on November 5. The hand count and the Scant- 
egrity count differed because officials were able to better 
determine voter intent during the hand count. For exam- 
ple, in the mayoral race, the scanner count determined 
that 646 votes were cast for candidate Schlegel, 972 for 
Williams, 15 for various write-in candidates, and 90 were 
not cast. The certified hand count totals were 664 votes 
for Schlegel, 1000 for Williams and 17 for write-in can- 
didates. Thus 48 of a total of 1681 votes in this race 
would not have been counted by a scanner count alone. 
The discrepancy was caused by voters marking ballots 
outside of the designated marking areas. Such marks, 
while not read by the scanner by definition, are consid- 
ered valid votes by Takoma Park law. Similarly, 8 of a 
total of 447 votes for Ward | council member, 8 of 251 
for Ward 2, 16 of 431 for Ward 3, 10 of 210 for Ward 4, 
2 of 81 for Ward 5 and 11 of 199 for Ward 6 were added 
to scanner vote totals after hand counting. 


Post-Election Audit During Meeting 4, held on 
November 6 at 6 p.m., trustees used Scantegrity-written 
code to reveal all codes on voted ballots, and to reveal ev- 
erything for all the ballots that were not spoiled or voted 
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Ward Councilor 


Roger B. Schlegel 664 Ward 1 = Josh Wright 
Bruce Williams 1000 Write-ins 
Ward 2. Colleen Clay Write-ins 


Write-ins 17 


Write-ins 
Ward 3 Dan Robinson Write-ins 


Write-ins 





Votes | Ward Councilor 
Ward 4 ‘Terry Seamens 
Eric Mendoza 


Ward 5 Reuben Snipper 
Ward 6 Navid Nasr 


Fred Schultz 
Write-ins 


Table 1: City certified election results for the Mayor’s race and each City Councilman’s race. 


upon. Trustees also responded to pseudo-random chal- 
lenges generated by stock market results at closing on 
November 6. Scantegrity published all data on Novem- 
ber 7th around 9am. While the SVST could have chosen 
to use closing data on an earlier date, such as November 
4 or November 5, which could have been more stable, 
the team chose to stick to its earlier-announced plan (of 
using the freshest stock market data) for the sake of con- 
sistency. 

On November 9, 2009, Dr. Adida and Dr. Zagorski 
independently confirmed that Scantegrity correctly re- 
sponded to all digital challenges. In particular, they 
confirmed that the tally computation audit data was cor- 
rect. Both made available independently-written code on 
their websites that voters and others could use to check 
the tally computation commitments. The Chair of the 
BoE mentions that several voters have shown an interest 
in running the code made available by Drs. Adida and 
Zagorski, and that she expects that Takoma Park voters 
will use the code to perform some audits themselves in 
the next few months. 


Confirmation Codes and Complaints The period for 
complaints regarding the election (including complaints 
about missing confirmation codes) expired at 6 pm on 
November 6. The Scantegrity website recorded 81 
unique ballot ID verifications, of which about 66 (almost 
4% of the total votes) were performed before the dead- 
line. The SVST was also told by a BoE member that 
at least a few voters checked codes on auditor websites. 
Both Dr. Adida and Dr. Zagorski made the confirmation 
codes available on their websites after the election. 

The number of voters who checked their ballots on- 
line before the Takoma Park complaint deadline (66), 
while not large, was sufficient to have detected (with 
high probability) any errors or fraud large enough to have 
changed the election outcome. (Detailed calculations 
omitted here; these calculations are not so simple, due 
to the use of IRV.) 

Scantegrity received a single complaint by a voter who 
had trouble deciphering a digit in the code and noted it 
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as “0,” while the Scantegrity website presented it as “8.” 
The voter requested that codes be printed more clearly in 
the future. He also stated that if he were not a trusting 
individual, he would believe that he had proof that his 
vote was altered. 

All codes for all voted ballots were revealed after 
the dispute resolution period, and all commitments ver- 
ified by two independent auditors, Dr. Adida and Dr. 
Zagorski. Hence, the probability that the code was in er- 
ror is very small, albeit non-zero. Scantegrity does not 
believe the code was in error, and there were no other 
complaints regarding confirmation numbers. 


Print Audits Dr. Zag6rski provided an interface al- 
lowing Ms. Coney to check the commitments 
opened by Scantegrity in Meeting 4 against the 
candidate/confirmation-code correspondence on the bal- 
lots she audited. In her report [12], she confirmed that 
the correspondence between confirmation numbers and 
candidates on all the printed ballots audited by her was 
correctly provided by the interface. 


Followup The Board of Elections and an SVST rep- 
resentative met to discuss the election and opportunities 
for improvement several weeks after the election. Both 
sides were largely satisfied with the election. Conversa- 
tions have begun regarding the use of Scantegrity in the 
next municipal election at Takoma Park, to be held in 
November 2011. No decisions have been taken. 


7 Surveys and Observations of Voter Expe- 
riences 


To understand the experiences of voters and poll workers, 
we timed some of the voters as they voted, asked voters 
and poll workers to fill out two questionnaires, and in- 
formally solicited comments from voters as they left the 
precinct building. Approved by the Board of Elections 
and UMBC’s Institutional Review Board, our procedures 
respected the constraint of not interfering with the elec- 
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tion process. This section summarizes the results of our 
observations and surveys. 


Timing Data Sitting unobtrusively as official ob- 
servers in a designated area of the polling room for part 
of the day, two helpers (not members of the Scantegrity 
team) timed 93 voters as they carried out the voting pro- 
cess. Using stopwatches, they measured the number of 
seconds that transpired from the time the voter received 
a ballot to the time the voter began walking away from 
the scanner. 

Voting times ranged from 55 secs. to 10mins. (the 
second longest time was 385 secs.), with a mean of 167 
secs. and a median of 150 secs. On average, voters who 
appeared older took longer than voters who appeared 
younger. Most of the time was spent marking the bal- 
lot. The average time to vote was significantly faster 
than during the April 2009 mock election, when voters 
took approximately 8 mins. on average due primarily to 
scanning delays [26]. 

The observers noted that many voters did not fully use 
the privacy sleeve as intended, removing the ballot before 
scanning rather than inserting the privacy sleeve with the 
ballot into the scanning slot. Two of the 93 observed 
voters initially inserted the privacy sleeve upside-down, 
causing the ballot not to be fed into the scanner (even 
though the scanner could read the ballot in any orienta- 
tion). A few ran into difficulties trying to insert the sleeve 
with one hand while holding something else in the other 
hand. 


Election Day Comments From Voters As voters left 
the precinct building, members of the Scantegrity team 
conducting the written surveys, and a helper (a usability 
expert who is not a member of the Scantegrity team) so- 
licited comments from voters with questions like, ““What 
did you think of the new voting system?” The helper so- 
licited comments 1:30-3:00pm and 7-8pm. A common 
response was, “It was easy.” 

Quite a few voters did not understand that they could 
verify their votes on-line and that, to do so, they had 
to write down the codenumbers revealed by their bal- 
lot choices. Some explained that they intentionally did 
not read any instructions because they “knew how to 
vote.” Others failed to notice or understand instructions 
on posters along the waiting line, in the voting booth, on 
the ballot, and in the Takoma Park Newsletter. 

In response, later in the day, we announced to voters 
as they entered the building that there is a new system; 
to verify your vote, write down the codenumbers. These 
verbal announcements seemed to have some positive ef- 
fect, and there were fewer voter comments expressing 
lack of awareness of the verification option after we be- 
gan the announcements. Nevertheless, some voters still 
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were unaware of the verification option. It was a hum- 
bling experience to see first-hand how difficult it can be 
to get across the most basic points effectively, especially 
the first time a new system is used. 

Some of the voters complained about the double- 
ended pen, not knowing which end to use, or having trou- 
ble writing in candidates with the chisel-point (the nar- 
row point was intended for write-ins). A small number 
of voters had difficulty seeing the codenumbers, perhaps 
largely because repeatedly pressing too hard could erode 
the paper. A few voters expressed concern about the dif- 
ficulty of writing down the codenumbers, had the ballot 
been much longer or had there been a large number of 
competing candidates. 

Many voters expressed a strong confidence in the in- 
tegrity of elections, while a small minority expressed 
sharp distrust in previous electronic election technology. 
These feelings seemed to be based more on a general 
subjective belief rather than on detailed knowledge of 
election procedures and technology. Similarly, those ex- 
pressing strong confidence in Scantegrity seemed to like 
the concept of verification but did not understand in de- 
tail why Scantegrity provides high outcome assurance. 


Survey of Voter Experiences As voters were leaving 
the precinct, we invited them to fill out two one-sided 
survey forms: a field-study questionnaire, and a demo- 
graphics questionnaire. The field study asked voters 
about the voting system they just used, with most an- 
swers expressed on a seven-point Likert scale. The last 
question invited voters to make any additional sugges- 
tions or comments. Each pair of forms had matching 
serial numbers to permit correlation of the field study 
responses with demographics. 271 voters filled out the 
forms. 

Fifty-one voters wrote comments on the question- 
naires, often pointing out confusion about various as- 
pects of the process but with no consistent theme. (1) 
Some were unaware of verification option. (2) Some did 
not realize they were supposed to write down codenum- 
bers. (3) Some found the pens confusing to use: they 
did not realize that the pens would expose codenumbers, 
and they did not know which end to use. (4) Some found 
codenumbers were hard to read. (5) Some did not under- 
stand how to mark an IRV ballot. (6) Some did not know 
how to place the ballot into the scanner. (7) One had no 
difficulty but wondered if seniors or people who speak 
neither English nor Spanish might have difficulties. (8) 
One wondered if the government might be able to discern 
his vote by linking his IP address used during verification 
with his ballot serial number and noting the time that he 
was issued a ballot (this may be possible if the cryptogra- 
phy is broken or in other scenarios, but it would be more 
direct to have the scanner log how he voted). (9) Many 
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suggested that it would have been helpful to have better 
instructions, including instruction while they wait in line. 

Figure 3 shows how voters responded to four ques- 
tions from the field study questionnaire. These results 
strongly show that voters found the voting system easy 
to use (Question 5), and that they had confidence in the 
system (Question 13). Question 10 showed that the op- 
tion to check votes on line increased voter confidence in 
the election results. Question 9 showed that voters had 
confidence that the receipt alone did not reveal how they 
voted; this finding is notable given that it is widely sus- 
pected that many people erroneously believe that all E2E 
receipts reveal ballot choices. We plan to present detailed 
analysis of our complete survey data in a separate com- 
panion paper. 


Survey of Poll Worker Experiences Each of the 
twelve poll workers was given an addressed and stamped 
envelope with two questionnaires (field study and demo- 
graphics) to fill out and mail to the researchers after the 
election. The field study focused on their experiences ad- 
ministering Scantegrity, with most answer expressed on 
a seven-point Likert scale. This questionnaire also in- 
cluded four open-ended questions. Each pair of forms 
had matching serial numbers. Five forms were returned. 

Poll workers noted the following difficulties. (1) There 
was too much information. (2) Some voters did not un- 
derstand what to do, including how to create a receipt. 
(3) Some voters did not understand how to mark an IRV 
ballot. (4) The privacy sleeve was hard to use with one 
hand. (5) The double-ended pens created confusion. (6) 
Voters, poll workers, and the Scantegrity team have dif- 
ferent needs. One wondered if Scantegrity was worth the 
extra trouble. 

They offered the following suggestions: (1) Simplify 
the ballot. (2) Provide receipts so that voters do not have 
to copy codenumbers. (3) Develop better pre-election 
voter education. 


$ Discussion and Lessons Learned 


Overall, this project should be deemed a success: the 
goals of the election were met, and there were no ma- 
jor snafus. Many aspects of the Scantegrity design and 
implementation worked well, while some could be im- 
proved in future elections. 


Technology Challenges Perhaps the most challenging 
aspect for future elections is scaling up ballot printing. 
The printers we used were not very reliable. 

Variations on the Scantegrity design worth exploring 
include the printing of voter receipts (rather than hav- 
ing voters copy confirmation codes by hand)—there are 
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clearly security aspects to handle if one does this. The 
design should also be extended for better accessibility. 
The special pen might be improved by having only a sin- 
gle medium-tip point, rather than two tips of different 
sizes. The scanning operation and its interaction with 
the privacy sleeve should be studied and improved. 

The website, while sufficient, might utilize existing re- 
search in distributed systems to reduce the expectations 
on observers and voters. The scanner could also be im- 
proved with more sophisticated image analysis, and also 
to better handle unreadable ballots. It only occured to 
us after the election that the write-in resolution process 
could have greater utility if it were expanded to deal with 
unreadable and unclear ballots. 


Real World Deployment of Research Systems As is 
common with many projects, too much was left until 
the last minute. Better project management would have 
been helpful, and key aspects should have been finalized 
earlier. Materials and procedures should be more exten- 
sively tested beforehand. 

One of the most important lessons learned is the 
value of close collaboration and clear communication be- 
tween election officials and the election system providers 
(whether they be researchers or vendors). 

Another lesson learned is that it is both important to 
provide voters with clear explanations of the new fea- 
tures of a voting system, and to do so efficiently, with 
minimal impact on throughput. Resolving the tension 
between these requirements definitely needs further ex- 
ploration. For example, it might be worthwhile to have 
an instructional video explaining the Scantegrity system 
that voters could watch as they come in. The permanent 
adoption of Scantegrity II in a jurisdiction would, how- 
ever, alleviate the educational burden over time, as voters 
learn the system’s features in successive elections. 


Comparison with post-election audits It is interest- 
ing to compare Scantegrity with the other major tech- 
nique for election outcome verification: post-election au- 
dits. Because these audits do not allow anyone to check 
that a particular ballot was counted correctly, they do 
not provide the level of integrity guarantee provided by 
Scantegrity. 

Post-election audits, even those with redundant digital 
and physical records like optical scan systems, only ad- 
dress errors or malfeasance in the counting of votes and 
not in the chain of custody. '? In contrast, end-to-end 


Having multiple records may make an attacker’s job harder, but 
note that the attacker only has to change the record that will ultimately 
be used and/or trusted (not necessarily both). Also, redudancy can work 
against a system, as changing a digital record in an obviously malicious 
way may allow time for a more subtle manipulation of the physical 
record. 
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Figure 3: Voter responses to Survey Questions 5, 9, 10, 13 from all 271 voters completing the survey. Using a seven- 
point Likert scale, voters indicated how strongly they agreed or disagreed with each statement about the voting system 
they had just used (1 = strongly disagree, 7 = strongly agree). Each histogram shows the number of voters responding 
for each of the seven agreement levels. The four questions shown are the following: (5) Overall, the voting system was 
easy to use. (9) I have confidence that my receipt by itself does not reveal how I voted. (10) The option to verify my 
vote online afterwards increases my confidence in the election results. (13) I have confidence in this voting system. 


voting systems such as Scantegrity provide a “verifiable 
chain of custody.” Voters can check that their ballots are 
included in the tally, and anyone—not just a privileged 
group of auditors—can check that those ballots are tal- 
lied as intended. 

It must be admitted, however, that the additional in- 
tegrity benefits provided by Scantegrity II come at the 
cost of somewhat increased complexity and at the cost 
of an increased (but manageable) risk to voter privacy 
(since ballots are uniquely identifiable). That said, some 
jurisdictions and/or election systems require or use serial 
numbers on ballots anyway, and we have proposed sev- 
eral possible approaches to appropriately destroy or ob- 
fuscate serial number information. Furthermore, it can 
be argued that a voter wishing to ’’fingerprint” a ballot 
can do so without being detected in current paper ballot 
systems simply by marking ovals in distinctive ways. 


9 Conclusions 


Traditional opscan voting systems have the clear bene- 
fit that “votes are verifiably cast as intended”—the voter 
can see for herself that the ballot is correctly filled out. 
Yet once her ballot is cast, the voter must place her trust 
in others that ballots are safely collected and correctly 
counted. With end-to-end voting systems these last two 
operations (collecting ballots and counting them) are ver- 
ifiable as well: voters can verify—using their receipt and 
a website—that their ballot is safely collected with the 
others, and anyone can use the website data to verify that 
the ballots have been correctly counted. The Scantegrity 
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II voting system provides such end-to-end verification 
capability as an overlay on top of traditional opscan tech- 
nology. Further development should improve scalability 
(esp. printing), usability (e.g. with printed receipts) and 
accessibility of the Scantegrity II system. 

The successful use of the Scantegrity II voting sys- 
tem in the Takoma Park election of November 3, 2009 
demonstrates that voters and election officials can use so- 
phisticated cryptographic techniques to organize a trans- 
parent secret ballot election with a familiar voting experi- 
ence. The election results show considerable satisfaction 
by both voters and pollworkers, indicating that end-to- 
end voting technology has matured to the point of being 
ready and usable for real binding governmental elections. 
This paper thus documents a significant step forward in 
the security and integrity of voting systems as used in 
practice. 
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Abstract 


We examine the problem of acoustic emanations of print- 
ers. We present a novel attack that recovers what a dot- 
matrix printer processing English text is printing based 
on a record of the sound it makes, if the microphone is 
close enough to the printer. In our experiments, the at- 
tack recovers up to 72 % of printed words, and up to 
95 % if we assume contextual knowledge about the text, 
with a microphone at a distance of 10cm from the printer. 
After an upfront training phase, the attack is fully auto- 
mated and uses a combination of machine learning, au- 
dio processing, and speech recognition techniques, in- 
cluding spectrum features, Hidden Markov Models and 
linear classification; moreover, it allows for feedback- 
based incremental learning. We evaluate the effective- 
ness of countermeasures, and we describe how we suc- 
cessfully mounted the attack in-field (with appropriate 
privacy protections) in a doctor’s practice to recover the 
content of medical prescriptions. 


1 Introduction 


Information leakage caused by emanations from elec- 
tronic devices has been a topic of concern for a long 
time. The first publicly known attack of this type, pub- 
lished in 1985, reconstructed the monitor’s content from 
its electromagnetic emanation [36]. The military had 
prior knowledge of similar techniques [41, 20]. Related 
techniques captured the monitor’s content from the ema- 
nations of the cable connecting the monitor and the com- 
puter [21], and acoustic emanations of keyboards were 
exploited to reveal the pressed key [3, 42, 7]. In this work 
we examine the problem of acoustic emanations of dot- 
matrix printers. 


Dot matrix printers? Didn’t these printers vanish in 
the 80s already? Although indeed outdated for private 
use, dot-matrix printers continue to play a surprisingly 
prominent role in businesses where confidential informa- 
tion is processed. We commissioned a representative sur- 
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vey from a professional survey institute [26] in Germany 
on this topic, with the following major lessons learned 
(Figure 1 contains additional information from this sur- 


vey): 


e About 60 % of all doctors in Germany use dot 
matrix printers, for printing the patients’ health 
records, medical prescriptions, etc. This corre- 
sponds to about 190,000 doctors and an average 
number of more than 2.4 million records and pre- 
scriptions printed on average per day. 


e About 30 % of all banks in Germany use dot matrix 
printers, for printing account statements, transcripts 
of transactions, etc. This corresponds to 14,000 
bank branches and more than 1.2 million such doc- 
uments printed on average per day. 


e Only about 5 % of these doctors and about 8 % 
of these banks currently plan to replace dot matrix 
printers. The reasons for the continued use of dot- 
matrix printers are manifold: robustness, cheap de- 
ployment, incompatibility of modern printers with 
old hardware, and overall the lack of a compelling 
business reason of IT laymen why working IT hard- 
ware should be modernized. 


e Several European countries (e.g., Germany, 
Switzerland, Austria, etc.) require by law the use 
of dot-matrix (carbon-copy) printers for printing 
prescriptions of narcotic substances [8]. 


1.1 Our contributions 


We show that printed English text can be successfully 
reconstructed from a previously taken recording of the 
sound emitted by the printer. The fundamental reason 
why the reconstruction of the printed text works is that, 
intuitively, the emitted sound becomes louder if more 
needles strike the paper at a given time (see Figure 2 for 
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DOCTORS (n=541 ASKED) 


Use dot-matrix printers 58.4 % 
- for general prescriptions 79.4% 
- for other usages 84.5 % 
Printer placed in proximity of patients 72.2 % 
Replacement planned 4.7% 


BANKS (n=524 ASKED) 


Use dot-matrix printers 30.0 % 
- for bank statement printers 29.9 % 
- for other usages 83.4 % 
Printer placed in proximity of customers 83.4 % 
Replacement planned 8.3 % 


Figure 1: Main results of the survey on the usage of dot-matrix printers in doctor’s practices and banks [26]. Other 
printer usages reported in the survey comprise: “certificate of incapacity for work, transferal to another doctor, hos- 
pitalization, and receipts” for doctors, and “account book, PIN numbers for online banking, supporting documents, 


ATMs” for banks. 





Figure 2: Print-head of an Epson LQ-300+II dot-matrix 
printer, showing the two rows of needles. 


a typical setting of 24 needles at the printhead). We ver- 
ified this intuition and we found that there is a correla- 
tion between the number of needles and the intensity of 
the acoustic emanation (see Figure 3). We first conduct a 
training phase where words from a dictionary are printed, 
and characteristic sound features of these words are ex- 
tracted and stored in a database. We then use the trained 
characteristic features to recognize the printed English 
text. (Training and recognition on a letter basis, simi- 
lar to [42], seems more appealing at first glance since it 
naturally comprises the whole vocabulary. However, the 
emitted sound is strongly blurred across adjacent letters, 
rendering a letter-based approach much poorer than the 
word-based approach, even if spell-checking is used, see 
below). 


This task is not trivial. Major challenges include: 
(1) Identifying and extracting sound features that suit- 
ably capture the acoustic emanation of dot-matrix print- 
ers; (11) Compensating for the blurred and overlapping 
features that are induced by the substantial decay time of 
the emanations; (111) Identifying and eliminating wrongly 
recognized words to increase the overall percentage of 
correctly identified words (recognition rate). 
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Overview of the approach. Our work addresses these 
challenges, using a combination of machine learning 
techniques for audio processing and higher-level infor- 
mation about document coherence. Similar techniques 
are used in language technology applications, in particu- 
lar in automatic speech recognition. 


First, we develop a novel feature design that borrows 
from commonly used techniques for feature extraction in 
speech recognition and music processing. These tech- 
niques are geared towards the human ear, which is lim- 
ited to approx. 20 kHz and whose sensitivity is logarith- 
mic in the frequency; for printers, our experiments show 
that most interesting features occur above 20 kHz, and a 
logarithmic scale cannot be assumed. Our feature design 
reflects these observations by employing a sub-band de- 
composition that places emphasis on the high frequen- 
cies, and spreading filter frequencies linearly over the 
frequency range. We further add suitable smoothing to 
make the recognition robust against measurement varia- 
tions and environmental noise. 


Second, we deal with the decay time and the induced 
blurring by resorting to a word-based approach instead of 
decoding individual letters. A word-based approach re- 
quires additional upfront effort such as an extended train- 
ing phase (as a word-based dictionary is larger), and it 
does not permit us to increase recognition rates by us- 
ing, e.g., spell-checking. Recognition of words based on 
training the sound of individual letters (or pairs/triples of 
letters), however, is infeasible because the sound emitted 
by printers blurs too strongly over adjacent letters. (Even 
words that differ considerably on the letter basis may 
yield highly similar overall sound features, which com- 
plicates the subsequent post-processing, see below.) This 
complication was not present in earlier work on acous- 
tic emanations of keyboards, since the time between two 
consecutive keystrokes is always large enough that blur- 
ring was not an issue [42]. 


Third, we employ speech recognition techniques to in- 
crease the recognition rate: we use Hidden Markov Mod- 
els (HMMs) that rely on the statistical frequency of se- 
quences of words in English text in order to rule out in- 
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Figure 3: Graph showing the correlation between the 
number of needles striking the ribbon and the measured 
acoustic intensity. 


correct word combinations. The presence of strong blur- 
ring, however, requires the use of at least 3-grams on the 
words of the dictionary to be effective, causing existing 
implementations for this task to fail because of memory 
exhaustion. To tame memory consumption, we imple- 
mented a delayed computation of the transition matrix 
that underlies HMMs, and in each step of the search 
procedure, we adaptively removed the words with only 
weakly matching features from the search space. 


Experiments, underlying assumptions and limita- 
tions. Before we describe our experiments, let us be 
clear about the underlying assumptions that render our 
approach possible. (i) The microphone (or bug) has 
to be (surreptitiously) placed in close proximity (about 
10cm) of the printer. (41) Because our approach is word- 
based for the reasons described above, it will only iden- 
tify words that have been previously trained; feedback- 
based incremental training of additional words is pos- 
sible. While this is less a concern for, e.g., recovering 
general English text and medical prescriptions, it renders 
the attack currently infeasible against passwords or PIN 
numbers. In the bank scenario, the approach can still be 
used to identify, e.g., the sender, recipient, or subject of a 
transaction. (111) Conducting the learning phase requires 
access to a dot matrix printer of the same model. There is 
no need to get hold of the actual printer at which the tar- 
get text was printed. (iv) If HMM-based post-processing 
is used, a corpus of (suitable) text documents is required 
to build up the underlying language model. Such post- 
processing is not always necessary, e.g., our in-field at- 
tack in a doctor’s practice described below did not exploit 
HMMs to recover medical prescriptions. 

We have built a prototypical implementation that can 
bootstrap the recognition routine from a database of 
featured words that have been trained using supervised 
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learning. We applied this implementation to four differ- 
ent English text documents, using a dictionary of about 
1,400 words (including the 1,000 most frequently used 
English words and the words that additionally occur in 
these documents, see the second assumption above) and a 
general-purpose corpus extracted from stable Wikipedia 
articles that the HMM-based post-processing relies upon. 
The prototype automatically recognizes these texts with 
recognition rates of up to 72 %. To investigate the 
impact of HMM-based post-processing with a domain- 
specific corpus instead of a general-purpose corpus on 
the recognition rate, we considered two additional docu- 
ments from a privacy-sensitive domain: living-will dec- 
larations. We used publicly available living-will dec- 
larations to extract a specialized corpus, thereby also 
increasing the dictionary to 2,150 words. Our proto- 
type automatically recognized the two target declarations 
with recognition rates of about 64 % using the general- 
purpose corpus, and increased the recognition rates to 
72 % and 95 %, respectively, using the domain-specific 
corpus. This shows that, somewhat expectedly, HMM- 
based post-processing is particularly worthwhile if prior 
knowledge about the domain of the target document can 
be assumed. 

We have identified and evaluated countermeasures that 
prevent this kind of attack. We found that fairly simple 
countermeasures such as acoustic shielding and ensur- 
ing a greater distance between the microphone and the 
printer suffice for most practical purposes. 

Furthermore, we have successfully mounted the at- 
tack in-field in a doctor’s practice to recover the con- 
tent of medical prescriptions. (For privacy reasons, we 
asked for permission upfront and let the secretary print 
fresh prescriptions of an artificial client.) The attack was 
observer-blind and conducted under realistic — and ar- 
guably even pessimistic — circumstances: during rush 
hour, with many people chatting in the waiting room. 


1.2 Related work 


Military organizations investigated compromising ema- 
nations for many years. Some of the results have been de- 
classified: the Germans spied on the French field phone 
lines in World War I [6], the Japanese spied on Amer- 
ican cipher machines using electromagnetic emanations 
in 1962 [1], the British spied on acoustic emanation of 
(mechanical) Hagelin encryption devices in the Egyptian 
embassy in 1956 [39, p. 82], and the British spied on par- 
asitic signals leaked by the French encryption machines 
in the 1950s [39, p. 1O09f]. 

The first publicly known attack we are aware of was 
published in 1985, and exploited electromagnetic radi- 
ation of CRT monitors [36, 16]. Since then, various 
forms of emanations have been exploited. Electromag- 
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Figure 4: Overview of the attack. 


netic emanations that constitute a security threat to com- 
puter equipment result from poorly shielded RS-232 se- 
rial lines [35], keyboards [2], as well as the digital cable 
connecting modern LCD monitors [21]. We refer to [22] 
for a discussion of the security limits for electromagnetic 
emanation. The time-varying diffuse reflections of the 
light emitted by a CRT monitor can be exploited to re- 
cover the original monitor image [19]; compromising re- 
flections were studied in [5, 4]. Information leaking from 
status LEDs was studied in [25]. 

Acoustic emanations were shown to divulge text typed 
on ordinary keyboards [3, 42, 7], as well as information 
about the CPU state and the instructions that are exe- 
cuted [33]. Acoustic emanations of printers were briefly 
mentioned before [10]; it was solely demonstrated that 
the letters “W” and “J” can be distinguished. This study 
did not determine whether any other letters can be dis- 
tinguished, let alone if a whole text can be reconstructed 
by inspection of the recording, or even in an automated 
manner. 

Several techniques from audio processing are adapted 
for use in our system. A central technique is feature ex- 
traction. We use features based on sub-band decompo- 
sition [27]. Alternative feature designs are based on the 
(Short-time) Fast Fourier Transform [34], or on the Cep- 
strum transformation [11] which is the basis for Mel Fre- 
quency Cepstral Coefficients (MFCC) [23, 15, 9, 24, 30]. 


1.3. Paper outline 


Section 2 presents a high-level description of our new 
attack, with full technical details given in Section 3. Sec- 
tion 4 presents experimental results. Section 5 describes 
the attack we conducted in-field. We conclude with some 
final remarks in Section 6. 
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2 Attack Overview 


In this section, we survey our attack without delving into 
the technical details. We consider the scenario that En- 
glish text containing potentially sensitive information is 
printed on a dot-matrix printer, and the emitted sound is 
recorded. We develop a methodology that on input the 
recording automatically reproduces the printed text. Fig- 
ure 4 presents a holistic overview of the attack. 

The first phase (Figure 4(a)) constitutes the training 
phase that can take place either before or after the attack. 
In this phase, a sequence of words from a dictionary is 
printed, and characteristic sound features of each word 
are extracted and stored in a database. For obtaining the 
best results, the setting should be close to the setting in 
which the actual attack is mounted, e.g., similar envi- 
ronmental noise and acoustics. Our experiments indicate 
that creating sufficiently good settings for reconstruction 
does not pose a problem, see Section 4.3.2. The main 
steps of the training phase are as follows: 


1. Feature extraction. We use a novel feature design 
that borrows from commonly used techniques for 
feature extraction in speech recognition and mu- 
sic processing. In contrast to these areas, our ex- 
periments show that most interesting features for 
printed sounds occur above 20 kHz, and that a log- 
arithmic scale cannot be assumed for them. We 
hence split the recording into single words based on 
the intensity of the frequency band between 20 kHz 
and 48 kHz, and spread the filter frequencies lin- 
early over the frequency range. We subsequently 
use digital filter banks to perform sub-band decom- 
position on each word [27]. As discussed in Sec- 
tion 3.1, sub-band decomposition gives better re- 
sults than simple FFT because of better time res- 
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olution. The output of sub-band decomposition is 
smoothed to make it more robust to measurement 
variations and environmental noise. The extracted 
features are stored in a database. 


2. Computation of language models. To solve the 
recognition task, we will complement acoustic in- 
formation with information about the occurrence 
likelihood of words in their linguistic context (e.g., 
the sequence “such as the” is much more likely than 
“such of the”). More specifically, we estimate for 
each word in our lexicon n-gram probabilities, 1.e., 
the likelihood that the word occurs after a sequence 
of n — 1 given words. These probabilities make 
up a (statistical) language model. Probabilities are 
computed based on frequency counts of n-place se- 
quences (n-grams) from a corpus of text documents. 
We need to extract these frequencies from a suf- 
ficiently large corpus, which makes up the second 
step of the training phase. In our experiments, we 
used 3-gram frequencies extracted from a corpus of 
10 million words of English text. For our domain- 
specific experiments, we used a corpus of living- 
will declarations consisting of 14,000 words of En- 
glish text. 


The second phase (Figure 4(b)), called the recognition 
phase, uses the characteristic features of the trained 
words to recognize new sound recordings of printed text, 
complemented by suitable language-correction tech- 
niques. The main steps are as follows: 


1. Select candidate words. We start by extracting fea- 
tures of the recording of the printed target text, as in 
the first step of the training phase. Let us call the ob- 
tained sequence of features target features whereas 
the features from the training phase stored in the 
database are henceforth referred to as trained fea- 
tures. Now, we subsequently compare, on a word- 
by-word basis, the obtained target features with 
the trained features of the dictionary stored in the 
database. 


If the features extracted from different recordings of 
the same word were always identical, one would ob- 
tain a unique correspondence between trained fea- 
tures and target features (under the assumption that 
all text words are in the dictionary). However, mea- 
surement variations, environmental noise, etc. show 
that this is not the case. Multiple recordings of the 
same word sometimes yield different features; for 
example, printing the same word at different places 
in the document results in differing acoustic em- 
anations (Figure 10 illustrates how a single verti- 
cal line already differs in the intensity); conversely, 
recordings of words that differ significantly in their 
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spelling might yield almost identical sound features. 
We hence let the selected, trained word be a random 
variable conditioned on the printed word, 1.e., every 
trained word will be a candidate with a certain prob- 
ability. Using sufficiently good feature extraction 
and distance computations between two features, 
the probabilities of one or a few such trained words 
will dominate for each printed word. The output 
of the first recognition step is a list of most likely 
candidates, given the acoustic features of the target 
word. 


2. Language-based reordering to reduce word error 
rate. We finally try to find the most likely se- 
quence of printed words given a ranked list of candi- 
date words for each printed word. Although always 
naively picking the most likely word based on the 
acoustic signal might already yield a suitable recog- 
nition quality, we employ Hidden Markov Model 
(HMM) technology, in particular language models 
and the Viterbi algorithm (see Section 3.3.3 for de- 
tails), which is regularly used in speech recognition, 
to determine the most likely sequence of printed 
words. Intuitively, this technology works well for us 
because most errors that we encounter in the recog- 
nition phase are due to incorrectly recognized words 
that do not fit the context; by making use of linguis- 
tic knowledge about likely and unlikely sequences 
of words, we have a good chance of detecting and 
correcting such errors. The use of HMM technology 
yields accuracy rates of 70 % on average for words 
for the general-purpose corpus, and up to 95 % for 
the domain-specific corpus, see Section 3.3 for de- 
tails. 


We modified the Viterbi algorithm to meet our spe- 
cific needs: first, the standard algorithm accepts as 
input a sequence of outputs, while we get for each 
position an ordered list of likely candidates, and we 
want to profit from this extra knowledge; second, 
we need to decrease memory usage, since a standard 
implementation would consume more than 30GB 
of memory. 


3. Technical Details 


In this section we provide technical details about our at- 
tack, including the background in audio processing and 
Hidden-Markov Models. 


3.1 Feature extraction 


We are faced with an audio file sampled at 96 kHz with 
16bit. 
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To split the recording into words, we use a threshold 
on the intensity of the frequency band from 20 kHz to 
48kHz. For printers, our experiments have shown that 
most interesting features occur above 20 kHz, making 
this frequency range a reliable indicator despite its sim- 
plicity; ignoring the lower frequencies moreover avoids 
most noise added by the movement of the print-head etc. 

From the split signal, we compute the raw spectrum 
features by sub-band decomposition, a common tech- 
nique in different areas of audio processing. The signal is 
filtered by a filter bank, a parallel arrangement of several 
bandpass filters tuned in steps of 1 kHz over the range 
from 1 kHz to 48 kHz. 

For noise reduction the output of the filters is 
smoothed, normalized, the amount of data is reduced (the 
maximal value out of 5 is kept), and smoothed again. The 
result is appropriately discretized over time and forms a 
set of vectors, one vector for each filter. 

The feature design has a major influence on the run- 
ning time and storage requirements of the subsequent 
audio processing. We have experimented with several 
alternative feature designs, but obtained the best results 
with the design described above. The (Short-time) Fast 
Fourier Transform (SFFT) [34] seems a natural alterna- 
tive to sub-band decomposition. There is, however, a 
trade-off between the frequency and the time resolution, 
and we obtain worse results in our setting when we used 
SFFTs, similar to earlier observations [42]. 


3.2 Select candidate words 


Deciding which database entry is the best match for a 
recording is based on the following distance function de- 
fined on features; the tool outputs the 30 most similar 
entries along with the calculated distance. Given the fea- 
tures extracted from the recording (#),...,2,) and the 
features of a single database entry (71,..., 4) we com- 
pute the angle between each pair of vectors 7;, y; and 
sum over all frequency bands: 


To increase robustness and decrease computational com- 
plexity in practical scenarios, some problems need to be 
addressed: First, our implementation of cutting the au- 
dio file sometimes errs a bit, which leads to slightly non- 
matching samples. Thus we consider minor shiftings of 
each sample by tiny amounts (two steps in each direction, 
or a total of 5 measurements) and take the minimum an- 
gle (1.e., the maximum similarity). Second, for a similar 
reason, we tolerate some deviation in the length of the 
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features. We punish too large deviations by multiplying 
with a factor of 1.2 if the length of the query and the 
database entry differ by more than a defined threshold. 
The factor and the threshold are derived from our exper- 
iments. Third, we discard entries whose length deviates 
from the target feature by more than 15 % in order to 
speed up the computation. 

Using the angle to compare features 1s a common tech- 
nique. Other approaches that are used in different sce- 
narios include the following: Miiller et al. present an 
audio matching method for chroma based features that 
handles tempo differences [28]. Logan and Salomon use 
signatures based on clustered MFCCs as input for the 
distance calculation in [24]. Furthermore, they use the 
earth mover’s distance [32] for the signatures (minimum 
amount of work to transform one signature into another) 
and the Kullback Leibler (KL) distance for the clusters 
inside the signature as distance measures. 


3.3. Post-processing using HMM technol- 
ogy 


In this section we describe techniques based on language 
models to further improve the quality of reconstruction. 
These improve the word recognition rate from 63 % 
to 70 % on average, and up to 72 % in some cases. 
The domain-specific HMM-based post-processing even 
achieves recognition rates of up to 95 %. 


3.3.1 Introduction to HMMs 


Hidden Markov models (HMMs) are graphical models 
for recovering a sequence of random variables which 
cannot be observed directly from a sequence of (ob- 
served) output variables. The random variables are mod- 
eled as hidden states, the output variables as observed 
states. HMMs have been employed for many tasks that 
deal with natural language processing such as speech 
recognition [31, 18, 17], handwriting recognition [29] or 
part-of-speech tagging [12, 14]. 

Formally, an HMM of order d is defined by a five-tuple 
(Q,0, A, B,I), where Q = (q1, q2,---, Un) is the set of 
(hidden) states, O = (01, 02,..., 0.7) is the set of obser- 
vations, A = Q“*! is the matrix of state transition prob- 
abilities (.e., the probability to reach state ggi; when 
being in state gq with history q1,...,gda—1), B=Q xO 
are the emission probabilities (1.e., the probability of ob- 
serving a specific output 0; when being in state q;), and 
I = Q? is the set of initial probabilities (i.e., the prob- 
ability of starting in state qg;). Figure 5 shows a graph- 
ical representation of an HMM, where unshaded circles 
represent hidden states and shaded circles represent ob- 
served states. 
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Figure 5: Hidden Markov Model 


In our setting the words that were printed are unknown 
and correspond to the hidden states. The observed states 
are the output of the first stage of reconstruction from 
the acoustic signals emitted by the printer. What makes 
HMMs particularly attractive for our task is that they al- 
low us to combine two sources of information: first, the 
acoustic information present in the observed signal, and 
second, knowledge about likely and unlikely word com- 
binations in a well-formed text. Both sources of infor- 
mation are important for recovering the original text. 

To utilize HMMs for our task, we need to solve two 
problems: we need to estimate the model parameters of 
the HMM (training phase), and we need to determine the 
most likely sequence of hidden states for a sequence of 
observations given the model (recognition phase). The 
method described in Section 3.2 approximates the es- 
timation of the emission probabilities by computing a 
ranking of the candidate words given an observed acous- 
tic signal. The initial probabilities, which model the 
probability of starting in a given state, and the transi- 
tion probabilities, which model the likelihood of differ- 
ent words following each other in an English text, can 
be obtained by building a language model from a large 
text corpus. To address the second problem, determin- 
ing the most likely sequence of hidden states (1.e., the 
most likely sequence of printed words in the target text), 
we can use the Viterbi algorithm [37]. In the following 
two sections, we describe in more detail how we com- 
pute the language models and how the candidate words 
are reordered by applying the Viterbi algorithm. 


3.3.2 Building the language models 


A language model of size n assigns a probability to each 
sequence of nm words. The probability distribution can be 
estimated by computing the frequencies of all n-grams 
from a large text corpus. Note that language models are 
to some extent domain and genre dependent, 1.e., a lan- 
guage model built from a corpus of financial texts will 
not be a very good model for predicting likely word se- 
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quences in biomedical texts. To cover a large range of 
domains and thus make our model robust in the face of 
arbitrary input texts, we train the language model on a 
diverse selection of stable Wikipedia articles. The cor- 
pus has a size of 63 MB and contains approximately 10 
million words. For our domain-specific experiments, we 
used a corpus of living-will declarations consisting of 
14,000 words of English text. From the corpus, we ex- 
tracted all 3-grams and computed their frequencies.! We 
took into consideration all 3-grams that appeared at least 
3 times. As n-grams with probability 0 will never be 
selected by the Viterbi algorithm, we smooth the proba- 
bilities by assigning a small probability to each unseen 
n-gram. 

The length of an n-gram determines how many words 
of context (i.e., how many previous hidden states in the 
HMM) are taken into account by the language model. 
Higher values for n can lead to better models but also 
require exponentially larger corpora for an accurate esti- 
mation of the n-gram probabilities. The higher the value 
of n, the larger the likelihood that some n-grams never 
appear in the corpus, even though they are valid word 
sequences and thus may still appear in the printed text. 


3.3.3 Reordering of candidate words based on lan- 
guage models 


Having built the language model, we can reorder the 
candidate words using the model to select the most 
likely word sequence (i1.e., the most likely sequence 
of hidden states). This task is addressed by the 
Viterbi algorithm [37], which takes as input an HMM 
(Q,O,A,B,I) of order d and a sequence of observa- 
tions a1,...,a7 € O”. Its state consists of V = Tx Q?. 
First, the d-th step is initialized (the earlier are unused) 
according to the initial distribution, weighted with the 


' All 3-grams were converted to lower case and punctuation charac- 
ters were stripped off. 
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observations: 


Wate Hh 


1geyta 
k=1,...,d 


In the recursion, for increasing indices s, the maximum 
of all previous values is taken: 


a ee — TDi se Nax (Abs ticnteW 1h and) 
19 EQ 
Vs >d,154,7 <N. 


Finally, the sequence of hidden states can be obtained 
by backtracking the indices that contributed to the maxi- 
mum in the recursion step. 

The memory required to store the state UV is O(T'- N%), 
and the running time is O(T' - N“@*"), as we are opti- 
mizing over all N hidden states for each cell, so mem- 
ory requirements are a major challenge in implementing 
the Viterbi algorithm. For example, using a dictionary 
of 1, 000 words, the memory requirements of our imple- 
mentation for 3-grams are slightly above 2 GB, and is 
growing quadratically in N. 

We use two techniques to overcome these problems: 


1. First, instead of storing the complete transition ma- 
trix A we compute the values on-the-fly (keeping 
only the list of 3-grams in memory). 


2. Second, we do not optimize over all possible words, 
but only over the M = 30 best rated words from 
the previous stage. This brings down memory re- 
quirements to O(T - M“) and execution time to 
O(T - M¢*'). The size of W in this case is 130 MB 
for 3-grams. 


Further improvements are conceivable, e.g., by using 
parallel scalability [40]. 


4 Experiments and Statistical Evaluation 


In this section we describe our experiments for evaluat- 
ing the attack. In addition to describing the set-up and the 
experimental results on the recognition rate for sample 
articles, we present our experiments for evaluating the 
influence of using different microphones, printers, fonts, 
etc. on the recognition rate; moreover, we identify and 
evaluate countermeasures. 


4.1 Setup 


We use an Epson LQ-300+II (24 needles) without printer 
cover and the in-built mono-spaced font for printing 
texts. The sound is recorded from a short distance us- 
ing a Sennheiser MKH-8040 microphone with nominal 
frequency range from 30 Hz to 50 kHz. If nothing addi- 
tional is mentioned the experiments were conducted in a 
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normal office with the door closed and no people talking 
inside the room. There was no special shielding against 
noise from the outside (e.g., traffic noise). In the training 
phase we used a dictionary containing 1,400 words; the 
dictionary consists of a list of the 1,000 most frequent 
words from our corpus augmented with the words that 
appeared in our example texts.” Inflected forms, capital- 
ization, as well as words with leading punctuation marks 
need to be counted as different words, as their sound fea- 
tures might significantly differ (blurring propagates from 
left to right within a word). 

We work with the sound recordings of four different 
articles from Wikipedia on different topics: two articles 
on computer science (on source-code and printers), one 
article on politics (on Barack Obama), and one article 
on art (on architecture) with a total of 1,181 words to 
evaluate the attack. 

The training and matching phase have been imple- 
mented in MATLAB using the Signal Processing Tool- 
box — a MATLAB extension which allows to conve- 
niently process audio signals. The HMM-based post- 
processing is implemented in C. The tool is fully auto- 
mated, with the only exceptions being threshold values 
that need manual adaption for a given attack scenario. In 
the scenario with the microphone placed 10cm in front 
of the printer obtaining the threshold values is straight- 
forward, as they can be determined directly from the 
intensity plots. In case of a more blurred signal (e.g., 
due to a larger distance), we iteratively determined suit- 
able values, essentially by trial-and-error. The training 
phase takes a one-time effort of several hours for build- 
ing up the sound feature database for the words in the 
dictionary. The recognition phase takes approximately 
2 hours for matching one page of text, including full 
HMM. -based post-processing. Memory usage of the pro- 
cedure is substantial, because the feature database and 
the HMM.-related information are kept in main memory 
to speed up computation. Trade-offs with less memory 
consumption but larger execution times can easily be re- 
alized. 


4.2 Results 


The recognition rates for the four articles in our exper- 
iments are depicted in Figure 6. The first row shows 
the recognition rates if no HMM-based post-processing 
is used, 1.e., these numbers correspond to the output of 
the matching phase. For illustration, we wrote in brack- 
ets the rate that the correct word was within the three 


*In a real attack, ensuring that (almost) all words of the text oc- 
cur in the dictionary can be achieved using several techniques: Using 
contextual knowledge to reduce the number of words that are likely to 
appear in the text, training a larger dictionary, or using feedback-based 
learning to subsequently add missing words to the dictionary. 
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Text 1 Text 2 


Text 3 Text 4 Overall 


Basic Top | (Top 3) 60.5% (75.1%) 66.5% (79.2%) 62.8% (78.7%) 61.5 % (77.9 %) 62.9 % (78.0 %) 


HMM 3-gram 66.7 % 71.8 % 


(2% 69.0 % 69.9 % 


Figure 6: Recognition rates of our four sample articles. The first row shows the recognition rates if no HMM-based 
post-processing is used; the second row depicts the recognition rates after applying post-processing with HMMs based 


on 3-grams using a general-purpose corpus. 


Declaration 1 Declaration 2 
Basic Top | (Top 3) 59.5% (77.8 %) 57.5 % (72.6 %) 
HMM 3-gram (using general-purpose corpus) 68.3 % 60.8 % 
HMM 3-gram (using domain-specific corpus) 95.2 % 72.5 % 


Figure 7: Recognition rates of our two additional documents using domain-specific HMM-based post-processing. 
The first row shows the recognition rates without HMM-based post-processing; the second and third rows depict the 
recognition rates after applying post-processing with HMMs based on 3-grams using a general-purpose corpus and a 


domain-specific corpus, respectively. 


highest-ranked words in the matching phase. The sec- 
ond row depicts the recognition rates after applying post- 
processing with HMMs based on 3-grams. We thus 
achieve recognition rates between 67 % and 72 % for 
the four articles. 


While the aforementioned results employ HMM- 
based post-processing using a general-purpose corpus, 
our experiments indicate that domain-specific corpora 
yield even better results. Recall that we considered two 
additional documents containing living-will declarations 
that we intended to analyze using a domain-specific cor- 
pus. The recognition rates for the two living-will decla- 
rations are depicted in Figure 7. The first / second row 
again depict the results without / with general-purpose 
HMM. -based post-processing; the third row shows the re- 
sults for HMM-based post-processing using the domain- 
specific corpus. We achieve recognition rates of 95.2 % 
and 72.5 % for the two documents, respectively. Text 
examples for the reconstruction using a general-purpose 
corpus and a domain-specific corpus are provided in Ap- 
pendix A and Appendix B, respectively. 


We also experimented with 4-gram and 5-gram lan- 
guage models. In addition to encountering even more 
severe problems of memory consumptions, our experi- 
ments indicated that the recognition rates do not improve 
over 3-grams. While this behavior might be surprising at 
a first glance, it can be explained by the sparseness of the 
training data: The number of 5-grams that we can extract 
from our corpus is approx. 10’, but the transition matrix 
of an HMM based on 5-grams on a dictionary of 1,000 
words has 10!° entries; thus the number of 5-grams is 
too small compared to the number of entries. For similar 
reasons 4-grams and 5-grams are rarely used in natural 
language processing. 
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4.3. Discussion and Supplemental Experi- 
ments 


We have evaluated the influence on the recognition rate 
of using different microphones, different printers, pro- 
portional fonts, etc., and we investigated why the recon- 
struction works from a conceptual perspective. In a nut- 
Shell, the results can be summarized as follows (details 
are given below): Several parameters of modified set-ups 
did not affect the recognition rate and gave comparable 
results, e.g., using cheaper microphones or using differ- 
ent printers (of the same model) for the training phase 
and the recognition phase. Using proportional instead 
of mono-spaced fonts or using different printer models 
only slightly decreased the recognition rate. Some con- 
siderably stronger modifications, however, did not work 
out at all, and they can be seen as conceptual limitations 
of our attack. This comprises using completely differ- 
ent printer technologies such as ink-jet or laser printers 
(because of the absence of suitable sound emissions that 
can be used to mount the attack). We provide statistical 
results on these modifications below. Furthermore, we 
evaluate countermeasures. 


4.3.1 Using different microphones 


Our experiments have indicated that information that is 
relevant for us is carried in the frequency range above 
approximately 20 kHz, see Section 3. Microphones with 
nominal frequency range higher than 20 kHz are rather 
expensive, e.g., the Sennheiser microphone referred to 
in Section 4.1 has a frequency range up to 50 kHz and 
costs approximately 1,300 dollars. However, our experi- 
ments have shown that some microphones with a nomi- 
nal frequency range of 20 kHz are sensitive to higher fre- 
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Sennheiser MKH-8040 microphone and Epson 
LQ-300+II printer 


Behringer B-5 microphone 
Sennheiser ME 2 clip-on microphone 


Top 1 (Top 3) 
62% (78 %) 
59% (85 %) 
57% (72%) 


OKI Microline 1190 printer 
Another Epson LQ-300+II 


Proportional font 


4Al% (51%) 
54% (72 %) 


57% (71 %) 


Figure 8: Results of the reconstruction with different microphone models and different printer models. (These control 
experiments were conducted on shorter texts and corpora than the previous experiments and no HMM-based post- 


processing was applied.) 


quencies as well (possibly with less accurate frequency 
response, but this had no noticeable influence on the 
recognition rate as long as we use the same microphone 
for recording both the training data and the attack data). 
Figure 8 shows in the second row the recognition rates 
of one sample article if a Behringer B-5 microphone is 
used, which has a nominal frequency range up to 20 kHz 
and costs approximately 80 dollars. The results obtained 
with the Behringer microphone are only slightly worse 
than the results using the Sennheiser microphone. 

We also conducted an experiment using a small clip- 
on microphone — a Sennheiser ME 2 with nominal fre- 
quency range up to 18 kHz, which costs approximately 
130 dollars. The recognition rates of one sample ar- 
ticle are shown in the third row of Figure 8; they are 
again only slightly worse than the rates with the larger 
Sennheiser microphone. 


4.3.2 Using different dot-matrix printers 


We also evaluated if the printer model influences the 
recognition rate. The fourth row of Figure 8 shows the 
recognition rates of one article printed with an OKI Mi- 
croline 1190 printer. The recognition rate is not as good 
as for the Epson printer, but it is still good. 

So far we always considered the set-up that training 
data and the attacked text are printed on the same printer. 
In a realistic attack scenario, however, it is unlikely 
that the attacker can print the training data on the same 
printer, but instead arranges access to another printer of 
the same printer model that he places in an acoustically 
similar environment. Our in-field attack described in de- 
tail in Section 5 is of this kind. 

We demonstrate that the recognition rate only de- 
creases slightly when using a different printer in the 
training phase. For this experiment we used the feature 
database that we previously recorded in the experiment 
described in Section 4.2, and printed one article on an- 
other Epson LQ-300+II printer that we bought from a 
different vendor. The recognition rate is shown in Fig- 
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Figure 9: Ink-jet printer, disassembled for analysis. 


ure 8, indicating a decrease of recognition rate of about 
8 % compared with the results from Section 4.2. 

This shows that it is practical to train a large dictionary 
offline. In the in-field attack described in Section 5 we 
use this result and train a dictionary on a separate printer. 


4.3.3 Using proportional fonts 


Monospaced fonts are commonly used in many appli- 
cations of dot-matrix printers; in particular, the in-built 
fonts are monospaced, and most applications seem to use 
these in-built fonts. Using proportional fonts instead in- 
tuitively relies on a more compact depiction of words that 
amplifies the effect of blurring. However, our experi- 
ments demonstrate that the recognition still works well, 
at a slightly lower rate (see Figure 8). 


4.3.4 On attacking other printer technologies 


While dot-matrix printers are still deployed in some 
security-critical applications (see Figure 1), they have 
been replaced by other printer technologies such as ink- 
jet printers (see Figure 9) and laser printers in other ap- 
plications. Ink-jet printers might be susceptible to simi- 
lar attacks, as they construct the printout from individ- 
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Figure 10: Each graph shows the intensity measured when printing a single vertical line, demonstrating the variations 


that can occur. 


ual dots, as dot-matrix printers do. On the one hand, 
the bubbles of ink might produce shock-waves in the air 
that potentially can be captured by a microphone; on the 
other hand, the piezo-electric elements used in some ink- 
jet printers might produce noise that can be measured. 
However, we were not able to capture these emanations. 
One reason might be that these faint sounds, if they ex- 
ist, are dominated by the noise emitted by the mechani- 
cal parts of a printer. For laser printers, one expects that 
no information about the printed text is leaked, and our 
experiments support this view. Thus, to the best of our 
knowledge, these printer technologies seem to be unaf- 
fected by this kind of attack. 


4.4 Countermeasures 


The (obvious) idea that underlies all countermeasures is 
to suppress the acoustic emanations so far that recon- 
struction becomes hard in practical scenarios. 


Acoustic shielding foam: The specific printer model that 
we used in most experiments has an optional printer 
cover with embedded acoustic shielding foam. Closing 
this cover absorbs a substantial amount of the acoustic 
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Closed door 0% (0%) 


Figure 11: More results of the reconstruction evaluating 
the effectiveness of different countermeasures. (These 
control experiments were conducted on a shorter text 
than the previous experiments, no HMM-based post- 
processing was applied.) 


emanation (see Figure 11). To further evaluate this idea, 
we built a box out of ordinary acoustic foam and placed 
the printer inside (shown in Figure 12). In contrast to the 
results with the cover, the recognition rate for the foam 
box was surprisingly good; 51 % of the words were re- 
constructed successfully. We believe that the shielding 
characteristics of the two types of foam suppress differ- 
ent ranges of the acoustic spectrum and thus have differ- 
ent effects on the reconstruction rate. 
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Figure 12: Printer in foam box for shielding evaluation. 


Distance: Our experiments indicate that the recogni- 
tion rate drops substantially if the distance between the 
printer and the microphone is increased. From a distance 
of 2 meters, the recognition rate drops to approximately 
4 % (see Figure 11). From this distance our algorithm for 
splitting the signal into words requires manual interven- 
tion, as the audio signal contains more noise. However, 
we Stress that this limitation can be circumvented in an 
in-field attack by placing a miniaturized wireless bug in 
close proximity to (or even in) the printer. 


Closed door: We also tested the reconstruction from out- 
side the printer’s room with the door closed; the over- 
all distance between the printer and the microphone was 
4 meters. As expected, we found that in this setup no 
reconstruction was possible at all. 


Our results indicate that ensuring the absence of mi- 
crophones in the printer’s room is sufficient to protect 
privacy. Unfortunately, this evaluation is not guaran- 
teed to be complete; we merely state that our attack does 
not work under these circumstances. However, we be- 
lieve that the potential for improvement is limited; thus 
the above discussion still provides reasonable estimates. 
As future work, we furthermore plan to investigate addi- 
tional countermeasures such as introducing randomness 
into the printer’s sound through software changes, e.g., 
by letting the printer print individual letters in a (some- 
what) randomized order instead of always proceeding 
left-to-right. 


5 In-field Attack 


We have successfully mounted the attack in-field in a 
doctor’s practice to recover the content of medical pre- 
scriptions (the setup of the attack 1s shown in Figure 13). 
For privacy reasons, we asked for permission upfront and 
let the secretary print fresh prescriptions of an artificial 
client. The attack was conducted under realistic — and 
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Figure 13: The setup of the in-field attack. 


arguably even pessimistic — circumstances: during rush 
hour, with many people chatting in the waiting room. 

We recorded the emitted sounds of printing seven dif- 
ferent prescriptions. We handed over all sound record- 
ings, the printouts of six prescriptions, and a printer of 
the same type (an Epson LQ-570) that we bought at Ebay 
to one of the authors of this paper. The printouts were 
only used to determine which parts of the sound record- 
ing correspond to which parts of the prescription. The 
attack was carried out blindly, 1.e, this author obtained 
no information about the seventh prescription except for 
its recorded sound. 

The author carrying out the attack took the following 
Steps: 


1. From the available printouts, he first identified the 
position of the prescribed medication, the direction 
of printing, and the used font. 


2. Using a suitable threshold, he subsequently deter- 
mined the correct length and the white-space posi- 
tions. 


3. From a publicly available medication directory with 
about 14,000 different medications, he then de- 
termined possible candidates that matched these 
lengths. Here, abbreviations of words were also 
taken into account. The list of remaining candidates 
consisted of 29 entries. 


4. The selection of candidate words (without HMM- 
based post-processing) then already revealed the 
correct medication out of the remaining 29 candi- 
dates. 


The correct medication was “Miiller’sche Tablet- 
ten bei Halsschmerzen’’, a medication against sore 
throat. The printing was even abbreviated on the 
prescription as 


Muller’sche Tabletten bei 
Halsschm. 
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The attack was actually easier to conduct in this practi- 
cal scenario compared to the experiments in Section 4, 
because we were able to substantially narrow down the 
list of candidates by taking into account length informa- 
tion of the medication. Admittedly, the secretary herself 
unintentionally simplified this task by selecting a long 
medication name consisting of several words. 


6 Conclusion 


We have presented a novel attack that takes as input a 
sound recording of a dot-matrix printer processing En- 
glish text, and recovers up to 72 % of printed words. 
If we assume contextual knowledge about the text, the 
attack achieves recognition rates up to 95 %. After an 
upfront training phase, the attack is fully automated and 
uses a combination of machine learning, audio process- 
ing and speech recognition techniques, including spec- 
trum features, Hidden Markov Models and linear clas- 
sification; moreover, it allows for feedback-based incre- 
mental learning. We have identified and evaluated coun- 
termeasures that are suitable to prevent this kind of at- 
tack. We have successfully mounted the attack in-field in 
a doctor’s practice to recover the content of medical pre- 
scriptions under realistic conditions. Moreover, we have 
shown the relevance of this attack by commissioning a 
representative survey that showed that dot-matrix print- 
ers are still deployed in a variety of sensitive areas, in 
particular by banks and doctors. 
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A Example Text Recognition with 
General-purpose HMM Post-processing 


In the following we give an excerpt of the text on print- 
ers [38], see Section 4.2, to demonstrate the reconstruc- 
tion. 


A.1_ The original text 


First, we give the original text. 


In computing, a printer is a 
peripheral which produces a hard 
copy (permanent human-readable 
text and/or graphics) of documents 
stored in electronic form, usually 
on physical print media such as 
paper or transparencies. Many 
printers are primarily used 

as local peripherals, and are 
attached by a printer cable 

or, in most newer printers, a 

USB cable to a computer which 
serves as a document source. 

Some printers, commonly known 

as network printers, have built-in 
network interfaces (typically 


wireless or Ethernet), and can 
serve as a hardcopy device for any 
user on the network. Individual 


printers are often designed to 
support both local and network 
connected users at the same time. 


A.2 Output of the reconstruction without 
HMM.-based post-processing 


Next, we give the reconstructed output without HMM- 
based post-processing. Recognition rate: 69 %. 


In computing, a printer in 5 
peripheral which produces 3 hard 
body (permanent human-readable 
text and/or graphics) of documents 
status in electronic form. 
usually 20 physical print media 
Such 30 pages or transparencies. 
Many Printers are primarily used 
go local peripherals, end are 
attached go A printer could 

or, in most newer printers; = 
USB cable go A computer which 
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served de = document source. 

name printers, commonly known 

go network printers; have built-in 
network interfaces (typically 


wireless As Ethernet), god way 
serve As = hardcopy device for out 
year we who network. Individual 


Printers use often designed 30 
support born local god network 
connected users go too name time. 


A.3 Output of the reconstruction with 
general-purpose HMM-based _post- 
processing 


Finally, we give the reconstructed output after apply- 
ing the HMM-based post-processing using a general- 
purpose corpus. Recognition rate: 74 %. 


in computing a printer ina 
peripheral which produces a hard 
body permanent human-readable 

text and/or graphics of documents 
source in electronic form usually 
as physical print media such as 
pages or transparencies many 
printers are primarily used go 
local peripherals end are attached 
go a printer could or in most 
newer printers a usb cable go 

a computer which served de = 
document source some printers 
commonly known go network printers 
have built-in network interfaces 
typically wireless as ethernet god 
way serve as a hardcopy device for 
out year we who network individual 
printers use often designed so 
support born local god network 
connected users as too some tree 


B Example Text Recognition with Domain- 
specific HMM Post-processing 


In the following we illustrate the recognition of an ex- 
cerpt of a living-will declaration [13], see Section 4.2, to 
illustrate the domain-specific post-processing. 


B.1 The original text 


First, we give the original text. 


ADVANCE HEALTH CARE DIRECTIVE 
INSTRUCTIONS: This form lets you 
give specific instructions about 
any aspect of your health care. 
Choices are provided for you to 
express your wishes regarding 
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the provision, withholding, 

Or withdrawal of treatment to 
keep you alive, as well as the 
provision of pain relief. Space 
is provided for you to add to 
the choices you have made or for 
you to write out any additional 
wishes. This form also lets you 
express an intention to donate 
your bodily organs and tissues 
following your death. Lastly, 
this form lets you designate 

a physician to have primary 
responsibility for your health 
care. 


B.2 Output of the reconstruction with 
general-purpose HMM-based _post- 
processing 


Next, we give the reconstructed output of the general- 
purpose HMM. -based post-processing. Recognition rate: 
68 %. 


advance health care directive 


instructions only form into you 
with consists observations peace 
who appear on your health care 
choices act provided for due to 
century many witness according 
one government declaration of 
witnesses be competent to been 
one alive as well as the provision 
of pain primary power to provided 
far one of out of now against 

the once made of way and we allow 
our own experience witness open 
form with lets can average as 
connected to donate year states 
canada and tissues including heat 
energy lastly this poor and you 
designing b according to food 
witness administration has been 
health care 


B.3. Output of the reconstruction with 
domain-specific HMM-based _post- 
processing 


Finally, we give the reconstructed output after applying 
the HMM-based post-processing using a domain-specific 
corpus. Recognition rate: 95 %. 


advance health care directive 


instructions move form lets you 
give consists instructions about 
any aspect of your health care 

choices are provided for you to 
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express your wishes regarding 

the provision withholding or 
withdrawal of treatment to keep 
you alive as well as the provision 
of pain relief space is provided 
for you to add to the choices 

you have made or for you to david 
out any additional wishes move 
form also lets you express an 
intention to donate your bodily 
organs and tissues following your 
death lastly this form lets you 
designate a physician to have 
primary responsibility for your 
health care 
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Abstract 


Wireless networks are being integrated into the modern 
automobile. The security and privacy implications of 
such in-car networks, however, have are not well under- 
stood as their transmissions propagate beyond the con- 
fines of a car’s body. To understand the risks associated 
with these wireless systems, this paper presents a privacy 
and security evaluation of wireless Tire Pressure Moni- 
toring Systems using both laboratory experiments with 
isolated tire pressure sensor modules and experiments 
with a complete vehicle system. We show that eaves- 
dropping is easily possible at a distance of roughly 40m 
from a passing vehicle. Further, reverse-engineering of 
the underlying protocols revealed static 32 bit identi- 
fiers and that messages can be easily triggered remotely, 
which raises privacy concerns as vehicles can be tracked 
through these identifiers. Further, current protocols do 
not employ authentication and vehicle implementations 
do not perform basic input validation, thereby allowing 
for remote spoofing of sensor messages. We validated 
this experimentally by triggering tire pressure warning 
messages in a moving vehicle from a customized soft- 
ware radio attack platform located in a nearby vehicle. 
Finally, the paper concludes with a set of recommenda- 
tions for improving the privacy and security of tire pres- 
sure monitoring systems and other forthcoming in-car 
wireless sensor networks. 


1 Introduction 


The quest for increased safety and efficiency of au- 
tomotive transportation system is leading car makers 
to integrate wireless communication systems into au- 
tomobiles. While vehicle-to-vehicle and vehicle-to- 
infrastructure systems [22] have received much attention, 
the first wireless network installed in every new vehicle 


*This study was supported in part by the US National Science Foun- 
dation under grant CNS-0845896, CNS-0845671, and Army Research 
Office grant W911NF-09-1-0089. 
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is actually an in-vehicle sensor network: the tire pres- 
sure monitoring system (TPMS). The wide deployment 
of TPMSs in the United States is an outgrowth of the 
TREAD Act [35] resulting from the Ford-Firestone tire 
failure controversy [17]. Beyond preventing tire fail- 
ure, alerting drivers about underinflated tires promises 
to increase overall road safety and fuel economy because 
proper tire inflation improves traction, braking distances, 
and tire rolling resistance. These benefits have recently 
led to similar legislation in the European Union [7] which 
mandates TPMSs on all new vehicles starting in 2012. 


Tire Pressure Monitoring Systems continuously mea- 
sure air pressure inside all tires of passenger cars, trucks, 
and multipurpose passenger vehicles, and alert drivers if 
any tire is significantly underinflated. While both direct 
and indirect measurement technologies exist, only direct 
measurement has the measurement sensitivity required 
by the TREAD Act and is thus the only one in produc- 
tion. A direct measurement system uses battery-powered 
pressure sensors inside each tire to measure tire pres- 
sure and can typically detect any loss greater than 1.45 
psi [40]. Since a wired connection from a rotating tire 
to the vehicle’s electronic control unit is difficult to 1m- 
plement, the sensor module communicates its data via a 
radio frequency (RF) transmitter. The receiving tire pres- 
sure control unit, in turn, analyzes the data and can send 
results or commands to the central car computer over 
the Controller-area Network (CAN) to trigger a warning 
message on the vehicle dashboard, for example. Indirect 
measurement systems infer pressure differences between 
tires from differences in the rotational speed, which can 
be measured using the anti-lock braking system (ABS) 
sensors. A lower-pressure tire has to rotate faster to travel 
the same distance as a higher-pressure tire. The disad- 
vantages of this approach are that it is less accurate, re- 
quires calibration by the driver, and cannot detect the si- 
multaneous loss of pressure from all tires (for example, 
due to temperature changes). While initial versions of the 
TREAD Act allowed indirect technology, updated rul- 
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ings by the United States National Highway Transporta- 
tion Safety Administration (NHTSA) have required all 
new cars sold or manufactured after 2008 in the United 
States to be equipped with direct TPMS [35] due to these 
disadvantages. 


1.1 Security and Privacy Risks 


Security and privacy aspects of vehicle-to-vehicle and 
vehicle-to-infrastructure communication have received 
significant consideration by both practitioners and re- 
searchers [3,36]. However, the already deployed in-car 
sensor communication systems have received little at- 
tention, because (i) the short communication range and 
metal vehicle body may render eavesdropping and spoof- 
ing attacks difficult and (11) tire pressure information ap- 
pears to be relatively innocuous. While we agree that 
the safety-critical application scenarios for vehicle-to- 
vehicle communications face higher security and privacy 
risks, we believe that even current tire pressure measure- 
ment systems present potential for misuse. 

First, wireless devices are known to present tracking 
risks through explicit identifiers in protocols [20] or iden- 
tifiable patterns in waveforms [10]. Since automobiles 
have become an essential element of our social fabric — 
they allow us to commute to and from work; they help us 
take care of errands like shopping and taking our children 
to day care — tracking automobiles presents substantial 
risks to location privacy. There is significant interest in 
wireless tracking of cars, at least for traffic monitoring 
purposes. Several entities are using mobile toll tag read- 
ers [4] to monitor traffic flows. Tracking through the 
TPMS system, if possible, would raise greater concerns 
because the use of TPMS is not voluntary and they are 
hard to deactivate. 

Second, wireless is easier to jam or spoof because no 
physical connection is necessary. While spoofing a low 
tire pressure readings does not appear to be critical at 
first, it will lead to a dashboard warning and will likely 
cause the driver to pull over and inspect the tire. This 
presents ample opportunities for mischief and criminal 
activities, if past experience is any indication. Drivers 
have been willing to tinker with traffic light timing to re- 
duce their commute time [6]. It has also been reported 
that highway robbers make drivers pull over by punc- 
turing the car tires [23] or by simply signaling a driver 
that a tire problem exists. If nothing else, repeated false 
alarms will undermine drivers’ faith in the system and 
lead them to ignore subsequent TPMS-related warnings, 
thereby making the TMPS system ineffective. 

To what extent these risks apply to TPMS and more 
generally to in-car sensor systems remains unknown. A 
key question to judge these risks is whether the range 
at which messages can be overheard or spoofed is large 
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enough to make such attacks feasible from outside the 
vehicle. While similar range questions have recently 
been investigated for RFID devices [27], the radio prop- 
agation environment within an automobile is different 
enough to warrant study because the metal body of a car 
could shield RF from escaping or entering a car. It is also 
unclear whether the TPMS message rate is high enough 
to make tracking vehicles feasible. This paper aims to 
fill this void, and presents a security and privacy analysis 
of state-of-the art commercial tire pressure monitoring 
systems, as well as detailed measurements for the com- 
munication range for in-car sensor transmissions. 


1.2 Contributions 


Following our experimental analysis of two popular 
TPMSs used in a large fraction of vehicles in the United 
States, this paper presents the following contributions: 


Lack of security measures. TPMS communications 
are based on standard modulation schemes and 
simple protocols. Since the protocols do not rely 
on cryptographic mechanisms, the communica- 
tion can be reverse-engineered, as we did using 
GNU Radio [2] in conjunction with the Universal 
Software Radio Peripheral (USRP) [1], a low-cost 
public software radio platform. Moreover, the 
implementation of the in-car system appears to 
fully trust all received messages. We found no 
evidence of basic security practices, such as input 
validation, being followed. Therefore, spoofing 
attacks and battery drain attacks are made possible 
and can cause TPMS to malfunction. 


Significant communication range. While the vehicle’s 
metal body does shield the signal, we found a larger 
than expected eavesdropping range. TPMS mes- 
sages can be correctly received up to 10m from the 
car with a cheap antenna and up to 40m with a ba- 
sic low noise amplifier. This means an adversary 
can overhear or spoof transmissions from the road- 
side or possibly from a nearby vehicle, and thus the 
transmission powers being used are not low enough 
to justify the lack of other security measures. 


Vehicle tracking. Each in-tire sensor module contains a 
32-bit immutable identifier in every message. The 
length of the identifier field renders tire sensor mod- 
ule IDs sufficiently unique to track cars. Although 
tracking vehicles is possible through vision-based 
automatic license plate identification, or through 
toll tag or other wireless car components, track- 
ing through TPMS identifiers raises new concerns, 
because these transmitters are difficult for drivers 
to deactivate as they are available in all new cars 
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and because wireless tracking is a low-cost solution 
compared to employing vision technology. 


Defenses. We discuss security mechanisms that are ap- 
plicable to this low-power in-car sensor scenario 
without taking away the ease of operation when in- 
stalling a new tire. The mechanisms include rela- 
tively straightforward design changes in addition to 
recommendations for cryptographic protocols that 
will significantly mitigate TMPS security risks. 


The insights obtained can benefit the design of other 
emerging wireless in-car sensing systems. Modern au- 
tomobiles contain roughly three miles of wire [31], and 
this will only increase as we make our motor vehicles 
more intelligent through more on-board electronic com- 
ponents, ranging from navigation systems to entertain- 
ment systems to in-car sensors. Increasing the amount 
of wires directly affects car weight and wire complex- 
ity, which decreases fuel economy [13] and imposes dif- 
ficulties on fault diagnosis [31]. For this reason, wire- 
less technologies will increasingly be used in and around 
the car to collect control/status data of the car’s electron- 
ics [16,33]. Thus, understanding and addressing the vul- 
nerabilities associated with internal automotive commu- 
nications, and TPMS in particular, is essential to ensur- 
ing that the new wave of intelligent automotive applica- 
tions will be safely deployed within our cars. 


1.3 Outline 


We begin in Section 2 by presenting an overview of 
TPMS and raising related security and privacy con- 
cerns. Although the specifics of the TPMS communi- 
cation protocols are proprietary, we present our reverse- 
engineering effort that reveals the details of the protocols 
in Section 3. Then, we discuss our study on the sus- 
ceptibility of TPMS to eavesdropping in Section 4 and 
message spoofing attacks in Section 5. After complet- 
ing our security and privacy analysis, we recommend de- 
fense mechanisms to secure TPMS in Section 6. Finally, 
we wrap up our paper by presenting related work in Sec- 
tion 7 before concluding in Section 8. 


2 TPMS Overview and Goals 


TPMS architecture. A typical direct TPMS contains 
the following components: TPM sensors fitted into the 
back of the valve stem of each tire, a TPM electric con- 
trol unit (ECU), a receiving unit (either integrated with 
the ECU or stand-alone), a dashboard TPM warning 
light, and one or four antennas connected to the receiving 
unit. The TPM sensors periodically broadcast the pres- 
sure and temperature measurements together with their 
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Figure 1: TPMS architecture with four antennas. 


identifiers. The TPM ECU/receiver receives the pack- 
ets and performs the following operations before send- 
ing messages to the TPM warning light. First, since it 
can receive packets from sensors belonging to neighbor- 
ing cars, it filters out those packets. Second, it performs 
temperature compensation, where it normalizes the pres- 
sure readings and evaluates tire pressure changes. The 
exact design of the system differs across suppliers, par- 
ticularly in terms of antenna configuration and commu- 
nication protocols. A four-antenna configuration is nor- 
mally used in high-end car models, whereby an antenna 
is mounted in each wheel housing behind the wheel arch 
shell and connected to a receiving unit through high fre- 
quency antenna cables, as depicted in Figure 1. The four- 
antenna system prolongs sensor battery life, since the an- 
tennas are mounted close to the TPM sensors which re- 
duces the required sensor transmission power. However, 
to reduce automobile cost, the majority of car manufac- 
tories use one antenna, which is typically mounted on the 
rear window [11,39]. 

Communication protocols. The communications pro- 


tocols used between sensors and TPM ECUs are propri- 
etary. From supplier websites and marketing materials, 
however, one learns that TPMS data transmissions com- 
monly use the 315 MHz or 433 MHz bands (UHF) and 
ASK (Amplitude Shift Keying) or FSK (Frequency Shift 
Keying) modulation. Each tire pressure sensor carries an 
identifier (ID). Before the TPMS ECU can accept data 
reported by tire pressure sensors, IDs of the sensor and 
the position of the wheel that it is mounted on have to be 
entered to the TPMS ECU either manually in most cars 
or automatically in some high-end cars. This is typically 
done during tire installation. Afterwards, the ID of the 
sensor becomes the key information that assists the ECU 
in determining the origin of the data packet and filtering 
out packets transmitted by other vehicles. 

To prolong battery life, tire pressure sensors are de- 
signed to sleep most of the time and wake up in two sce- 
narios: (1) when the car starts to travel at high speeds 
(over 40 km/h), the sensors are required to monitor tire 
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pressures; (2) during diagnosis and the initial sensor 
ID binding phases, the sensors are required to transmit 
their IDs or other information to facilitate the procedures. 
Thus, the tire pressure sensors will wake up in response 
to two triggering mechanisms: a speed higher than 40 
km/h detected by an on-board accelerometer or an RF 
activation signal. 

The RF activation signals operate at 125 kHz in the 
low frequency (LF) radio frequency band and can only 
wake up sensors within a short range, due to the gener- 
ally poor characteristics of RF antennas at that low fre- 
quency. According to manuals from different tire sen- 
sor manufacturers, the activation signal can be either a 
tone or a modulated signal. In either case, the LF re- 
ceiver on the tire sensor filters the incoming activation 
signal and wakes up the sensor only when a matching 
signal is recognized. Activation signals are mainly used 
by car dealers to install and diagnose tire sensors, and are 
manufacturer-specific. 


2.1 Security and Privacy Analysis Goals 


Our analysis will concentrate on tracking risks through 
eavesdropping on sensor identifiers and on message 
spoofing risks to insert forged data in the vehicle ECU. 
The presence of an identifier raises the specter of lo- 
cation privacy concerns. If the sensor IDs were cap- 
tured at roadside tracking points and stored in databases, 
third parties could infer or prove that the driver has vis- 
ited potentially sensitive locations such as medical clin- 
ics, political meetings, or nightclubs. A similar example 
is seen with electronic toll records that are captured at 
highway entry and exit points by private entities for traf- 
fic monitoring purposes. In some states, these records 
are frequently subpoenaed for civil lawsuits. If tracking 
through the tire pressure monitoring system were pos- 
sible, this would create additional concerns, particularly 
because the system will soon be present in all cars and 
cannot easily be deactivated by a driver. 

Besides these privacy risks, we will consider attacks 
where an adversary interferes with the normal operations 
of TPMS by actively injecting forged messages. For in- 
stance, an adversary could attempt to send a low pressure 
packet to trigger a low pressure warning. Alternatively, 
the adversary could cycle through a few forged low pres- 
sure packets and a few normal pressure packets, causing 
the low pressure warning lights to turn on and off. Such 
attacks, if possible, could undermine drivers’ faith in the 
system and potentially lead them to ignore TPMS-related 
warnings completely. Last but not least, since the TPM 
sensors always respond to the corresponding activation 
signal, an adversary that continuously transmits activa- 
tion signals can force the tire sensors to send packets 
constantly, greatly reducing the lifetime of TPMS. 
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To evaluate the privacy and security risks of such a 
system, we will address the issues listed below in the 
following sections. 


Difficulty of reverse engineering. Many potential at- 
tackers are unlikely to have access to insider 1n- 
formation and must therefore reconstruct the proto- 
cols, both to be able to extract IDs to track vehicles 
and to spoof messages. The level of information 
necessary differs among attacks; replays for exam- 
ple might only require knowledge of the frequency 
band but more sophisticated spoofing requires pro- 
tocol details. For spoofing attacks we also consider 
whether off-the-shelf radios can generate and trans- 
mit the packets appropriately. 


Identifier characteristics. Tracking requires observing 
identifying characteristics from a message, so that 
multiple messages can be linked to the same vehi- 
cle. The success of tracking is closely tied to the 
answers to: (1) Are the sensor IDs used temporar- 
ily or over long time intervals? (2) Does the length 
of the sensor ID suffice to uniquely identify a car? 
Since the sensor IDs are meant to primarily identify 
their positions in the car, they may not be globally 
unique and may render tracking difficult. 


Transmission range and frequency. Tracking further 
depends on whether a road-side tracking unit will be 
likely to overhear a transmission from a car passing 
at high speed. This requires understanding the range 
and messaging frequency of packet transmissions. 
To avoid interference between cars and to prolong 
the battery life, the transmission powers of the sen- 
sors are deliberately chosen to be low. Is it possible 
to track vehicles with such low transmission power 
combined with low messaging frequency? 


Security measures. The ease of message spoofing de- 
pends on the use of security measures in TPMSs. 
The key questions to make message spoofing a prac- 
tical threat include: (1) Are messages authenti- 
cated’? (2) Does the vehicle use consistency checks 
and filtering mechanisms to reject suspicious pack- 
ets? (3) How long, if possible, does it take the ECU 
to completely recover from a spoofing attack? 


3 Reverse Engineering TPMS Communi- 
cation Protocols 


Analyzing security and privacy risks begins with obtain- 
ing a thorough comprehension of the protocols for spe- 
cific sensor systems. To elaborate, one needs to know 
the modulation schemes, encoding schemes, and mes- 
sage formats, in addition to the activation and reporting 
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Figure 2: Equipment used for packet sniffing. At the bottom, 
from left to right are the ATEQ VT55 TPMS trigger tool, two 
tire pressure sensors (TPS-A and TPS-B), and a low noise am- 
plifier (LNA). At the top is one laptop connected with a USRP 
with a TVRX daughterboard attached. 


methodologies to properly decode or spoof sensor mes- 
sages. Apart from access to an insider or the actual spec- 
ifications, this information requires reverse-engineering 
by an adversary. To convey the level of difficulty of this 
process for in-car sensor protocols, we provide a brief 
walk-through of our approach below, where we begin by 
presenting relevant hardware. 


Tire pressure sensor equipment. We selected two 
representative tire pressure sensors that employ different 
modulation schemes. Both sensors are used in automo- 
biles with high market shares in the US. To prevent mis- 
use of the information here, we refer to these sensors 
simply as tire pressure sensor A (TPS-A) and tire pres- 
sure sensor B (TPS-B). To help our process, we also ac- 
quired a TPMS trigger tool, which is available for a few 
hundred dollars. Such tools are handheld devices that 
can activate and decode information from a variety of 
tire sensor implementations. These tools are commonly 
used by car technicians and mechanics for troubleshoot- 
ing. For our experiments, we used a TPMS trigger tool 
from ATEQ [8] (ATEQ VTS55). 


Raw signal sniffer. Reverse engineering the TPMS 
protocols requires the capture and analysis of raw sig- 
nal data. For this, we used GNU Radio [2] in con- 
junction with the Universal Software Radio Peripheral 
(USRP) [1]. GNU Radio is an open source, free software 
toolkit that provides a library of signal processing blocks 
that run on a host processing platform. Algorithms 1m- 
plemented using GNU Radio can receive data directly 
from the USRP, which is the hardware that provides RF 
access via an assortment of daughterboards. They in- 
clude the TVRX daughterboard capable of receiving RF 
in the range of 50 Mhz to 870 MHz and the LFRX daugh- 
terboard able to receive from DC to 30 MHz. For con- 
venience, we initially used an Agilent 89600 Vector Sig- 
nal Analyzer (VSA) for data capture (but such equipment 
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is not necessary). The pressure sensor modules, trigger 
tool, and software radio platform are shown in Figure 2. 


3.1 Reverse Engineering Walk Through 


While our public domain search resulted in only high- 
level knowledge about the TPM communication proto- 
col specifics, anticipating sensor activity in the 315/433 
MHz bands did provide us with a starting point for our 
reverse engineering analysis. 

We began by collecting a few transmissions from each 
of the TPM sensors. The VSA was used to narrow down 
the spectral bandwidth necessary for fully capturing the 
transmissions. The sensors were placed close to the VSA 
receiving antenna while we used the ATEQ VTS to trig- 
ger the sensors. Although initial data collections were 
obtained using the VSA, the research team switched to 
using the USRP to illustrate that our findings (and subse- 
quently our attacks) can be achieved with low-cost hard- 
ware. An added benefit of using the USRP for the data 
collections is that it is capable of providing synchronized 
collects for the LF and HF frequency bands — thus al- 
lowing us to extract important timing information be- 
tween the activation signals and the sensor responses. To 
perform these collects, the TVRX and LFRX daughter- 
boards were used to provide access to the proper radio 
frequencies. Once the sensor bursts were collected, we 
began our signal analysis in MATLAB to understand the 
modulation and encoding schemes. The final step was to 
map out the message format. 


Determine coarse physical layer characteristics. 
The first phase of characterizing the sensors involved 
measuring burst widths, bandwidth, and other physical 
layer properties. We observed that burst widths were 
on the order of 15 ms. During this initial analysis, we 
noted that each sensor transmitted multiple bursts in re- 
sponse to their respective activation signals. TPS-A used 
4 bursts, while TPS-B responded with 5 bursts. Indi- 
vidual bursts in the series were determined to be exact 
copies of each other, thus each burst encapsulates a com- 
plete sensor report. 


Identify the modulation scheme. Analysis of the 
baseband waveforms revealed two distinct modulation 
schemes. TPS-A employed amplitude shift keying 
(ASK), while TPS-B employed a hybrid modulation 
scheme — simultaneous usage of ASK and frequency 
shift keying (FSK). We speculate that the hybrid scheme 
is used for two reasons: (1) to maximize operability with 
TPM readers and (2) to mitigate the effects of an adverse 
channel during normal operation. Figure 3 illustrates the 
differences between the sensors’ transmission in both the 
time and frequency domains. The modulation schemes 
are also observable in these plots. 
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Figure 3: A comparison of FFT and signal strength time series 
between TSP-A and TSP-B sensors. 


Resolve the encoding scheme. Despite the different 
modulation schemes, it was immediately apparent that 
both sensors were utilizing Manchester encoding (after 
distinct preamble sequences). The baud rate is directly 
observable under Manchester encoding and was on the 
order of 5 kBd. The next step was to determine the bit 
mappings from the Manchester encoded signal. In order 
to accomplish this goal, we leveraged knowledge of a 
known bit sequence in each message. We knew the sen- 
sor ID because it was printed on each sensor and assumed 
that this bit sequence must be contained in the message. 
We found that applying differential Manchester decoding 
generated a bit sequence containing the sensor ID. 


Reconstructing the message format. While both 
sensors used differential Manchester encoding, their 
packet formats differed significantly. Thus, our next step 
was to determine the message mappings for the rest of 
the bits for each sensor. To understand the size and mean- 
ing of each bitfield, we manipulated sensor transmissions 
by varying a single parameter and observed which bits 
changed in the message. For instance, we adjusted the 
temperature using hot guns and refrigerators, or adjusted 
the pressure. By simultaneously using the ATEQ VT55, 
we were also able to observe the actual transmitted val- 
ues and correlate them with our decoded bits. Using this 
approach, we managed to determine the majority of mes- 
sage fields and their meanings for both TPS-A and TPS- 
B. These included temperature, pressure, and sensor ID, 
as illustrated in Figure 4. We also identified the use of 
a CRC checksum and determined the CRC polynomials 
through a brute force search. 

At this point, we did not yet understand the meaning 
of a few bits in the message. We were later able to recon- 
struct these by generating messages with our software ra- 
dio, changing these bits, and observing the output of the 
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Figure 4: An illustration of a packet format. Note the size is 
not proportional to real packet fields. 


TPMS tool or a real car. It turned out that these were pa- 
rameters like battery status, over which we had no direct 
control by purely manipulating the sensor module. More 
details on message spoofing are presented in Section 5. 


3.2 Lessons Learned 


The aforementioned reverse-engineering can be accom- 
plished with a reasonable background in communica- 
tions and computer engineering. It took a few days for 
a PhD-level engineer experienced with reverse engineer- 
ing to build an initial system. It took several weeks for an 
MS-level student with no prior experience in reverse en- 
gineering and GNU Radio programming to understand 
and reproduce the attack. The equipment used (the 
VTEQ VTS55 and USRP attached with TVRX) is openly 
available and costs $1500 at current market prices. 

Perhaps one of the most difficult issues involved baud 
rate estimation. Since Manchester encoding is used, our 
initial baud rate estimates involved averaging the gaps 
between the transition edges of the signal. However, the 
jitter (most likely associated with the local oscillators of 
the sensors) makes it almost impossible to estimate a 
baud rate accurate enough for a simple software-based 
decoder to work correctly. To address this problem, we 
modified our decoders to be self-adjustable to compen- 
sate for the estimation errors throughout the burst. 

The reverse engineering revealed the following obser- 
vations. First, it is evident that encryption has not been 
used—which makes the system vulnerable to various at- 
tacks. Second, each message contains a 28-bit or 32-bit 
sensor ID depending on the type of sensor. Regardless 
of the sensor type, the IDs do not change during the sen- 
sors’ lifetimes. 

Given that there are 254.4 million registered passenger 
vehicles in United States [34], one 28-bit Sensor ID is 
enough to track each registered car. Even in the future 
when the number of cars may exceed 256 million, we 
can still identify a car using a collection of tire IDs — 
a 4-tuple of tire IDs. Assuming a uniform distribution 
across the 28-bit ID space, the probability of an exact 
match of two cars’ IDs is 4!/2''? without considering 
the ordering. To determine how many cars FR can be on 
the road in the US with a guarantee that there is a less 
than P chance of any two or more cars having the same 
[D-set, is a classical birthday problem calculation: 
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Figure 5: Block chart of the live decoder/eavesdropper. 


To achieve a match rate of larger than P = 1%, more 
than 10!° cars need to be on the road, which is signif- 
icantly more than 1 billion cars. This calculation, of 
course, 1s predicated on the assumption of a uniform al- 
location across the 28-bit ID space. Even if we relax this 
assumption and assume 20 bits of entropy in a single 28- 
bit ID space, we would still need roughly 38 billion cars 
in the US to get a match rate of more than P = 1%. 

We note that this calculation is based on the unrealis- 
tic assumption that all 38 billion cars are co-located, and 
are using the same modulation and coding schemes. Ul- 
timately, it is very unlikely to have two cars that would 
be falsely mistaken for each other. 


4 Feasibility of Eavesdropping 


A critical question for evaluating privacy implications of 
in-car wireless networks is whether the transmissions can 
be easily overheard from outside the vehicle body. While 
tire pressure data does not require strong confidentiality, 
the TPMS protocols contain identifiers that can be used 
to track the locations of a device. In practice, the proba- 
bility that a transmission can be observed by a stationary 
receiver depends not only on the communication range 
but also on the messaging frequency and speed of the 
vehicle under observation, because these factors affect 
whether a transmission occurs in communication range. 

The transmission power of pressure sensors is rela- 
tively small to prolong sensor battery lifetime and reduce 
cross-interference. Additionally, the NHTSA requires 
tire pressure sensors to transmit data only once every 60 
seconds to 90 seconds. The low transmission power, low 
data report rate, and high travel speeds of automobiles 
raise questions about the feasibility of eavesdropping. 

In this section, we experimentally evaluate the range 
of TPMS communications and further evaluate the feasi- 
bility of tracking. This range study will use TPS-A sen- 
sors, since their TPMS uses a four-antenna structure and 
operates at a lower transmission power. It should there- 
fore be more difficult to overhear. 


4.1 Eavesdropping System 


During the reverse engineering steps, we developed 
two Matlab decoders: one for decoding ASK mod- 
ulated TPS-A and the other for decoding the FSK 
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modulated TPS-B. In order to reuse our decoders yet 
be able to constantly monitor the channel and only 
record useful data using GNU radio together with the 
USRP, we created a live decoder/eavesdropper leverag- 
ing pipes. We used the GNU Radio standard Python 
script usrp_rx_cfile.py to sample channels at a rate 
of 250 kHz, where the recorded data was then piped to a 
packet detector. Once the packet detector identifies high 
energy in the channel, it extracts the complete packet and 
passes the corresponding data to the decoder to extract 
the pressure, temperature, and the sensor ID. If decoding 
is successful, the sensor ID will be output to the screen 
and the raw packet signal along with the time stamp will 
be stored for later analysis. To be able to capture data 
from multiple different TPMS systems, the eavesdrop- 
ping system would also need a modulation classifier to 
recognizes the modulation scheme and choose the corre- 
sponding decoder. For example, Liedtke’s [29] algorithm 
could be used to differentiate ASK2 and FSK2. Such an 
eavesdropping system is depicted in Fig. 5. 

In early experiments, we observed that the decoding 
script generates much erratic data from interference and 
artifacts of the dynamic channel environment. To address 
this problem, we made the script more robust and added 
a filter to discard erroneous data. This filter drops all 
signals that do not match TPS-A or TPS-B. We have 
tested our live decoder on the interstate highway I-26 
(Columbia, South Carolina) with two cars running in par- 
allel at speeds exceeding 110 km/h. 


4.2 Eavesdropping Range 


We measured the eavesdropping range in both indoor and 
outdoor scenarios by having the ATEQ VTS55 trigger the 
sensors. In both scenarios, we fixed the location of the 
USRP at the origin (0,0) in Figure 7 and moved the 
sensor along the y-axis. In the indoor environment, we 
studied the reception range of stand-alone sensors in a 
hallway. In the outdoor environment, we drove one of 
the authors’ cars around to measure the reception range 
of the sensors mounted in its front left wheel while the 
car’s body was parallel to the x-axis, as shown in Fig- 
ure 7. In our experiment, we noticed that we were able 
to decode the packets when the received signal strength is 
larger than the ambient noise floor. The resulting signal 
strength over the area where packets could be decoded 
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Figure 6: Comparison of eavesdropping range of TPS-A. 


successfully and the ambient noise floors are depicted 
in Figure 6 (a). The results show that both the outdoor 
and indoor eavesdropping ranges are roughly 10.7 m, the 
vehicle body appears only to have a minor attenuation 
effect with regard to a receiver positioned broadside. 

We next performed the same set of range experiments 
while installing a low noise amplifier (LNA) between the 
antenna and the USRP radio front end, as shown in Fig- 
ure 2. As indicated in Figure 6, the signal strength of 
the sensor transmissions still decreased with distance and 
the noise floor was raised because of the LNA, but the 
LNA amplified the received signal strength and improved 
the decoding range from 10.7 meters to 40 meters. This 
shows that with some inexpensive hardware a significant 
eavesdropping range can be achieved, a range that allows 
signals to be easily observed from the roadside. 

Note that other ways to boost receiving range exist. 
Examples include the use of directional antennas or more 
sensitive omnidirectional antennas. We refer readers to 
the antenna studies in [9, 15,42] for further information. 


4.3. Eavesdropping Angle Study 


We now investigate whether the car body has a larger 
attenuation effect if the receiver is located at different 
angular positions. We also study whether one USRP is 
enough to sniff packets from all four tire sensors. 


The effect of car body. In our first set of experiments, 
we studied the effect of the car’s metallic body on signal 
attenuation to determine the number of required USRPs. 
We placed the USRP antenna at the origin of the coordi- 
nate, as shown in Figure 7, and position the car at several 
points on the line of y = 0.5 with its body parallel to 
the x-axis. Eavesdropping at these points revealed that it 
is very hard to receive packets from four tires simultane- 
ously. A set of received signal strength (RSS) measure- 
ments when the front left wheel was located at (0, 0.5) 
meters are summarized in Table 1. Results show that 
the USRP can receive packets transmitted by the front 
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left, front right and rear left sensors, but not from the 
rear right sensor due to the signal degradation caused by 
the car’s metallic body. Thus, to assure receiving pack- 
ets from all four sensors, at least two observation spots 
may be required, with each located on either side of the 
car. For instance, two USRPs can be placed at different 
spots, or two antennas connected to the same USRP can 
be meters apart. 


The eavesdropping angle at various distances. We 
studied the range associated with one USRP receiving 
packets transmitted by the front left wheel. Again, we 
placed the USRP antenna at the origin and recorded 
packets when the car moved along trajectories parallel to 
the x-axis, as shown in Figure 7. These trajectories were 
1.5 meters apart. Along each trajectory, we recorded 
RSS at the locations from where the USRP could decode 
packets. The colored region in Figure 11, therefore, de- 
notes the eavesdropping range, and the contours illustrate 
the RSS distribution of the received packets. 


From Figure 11, we observe that the maximum hori- 
zontal eavesdropping range, Tq, Changes as a function 
of the distance between the trajectory and the USRP an- 
tenna, d. Additionally, the eavesdropping ranges on both 
sides of the USRP antenna are asymmetric due to the 
car’s metallic body. Without the reflection and imped- 
iment of the car body, the USRP is able to receive the 
packets at further distances when the car is approaching 
rather than leaving. The numerical results of Trmaz, Y1, 
the maximum eavesdropping angle when the car is ap- 
proaching the USRP, and @2, the maximum angle when 
the car is leaving the USRP, are listed in Figure 8. Since 


[Location | RSS (aB) || Location | RSS (dB) 


Frontlefi | -418 | Rearlet [55.0 
[Front right | -34.4 | Rearright [N/A 





Table 1: RSS when USPR is located 0.5 meters away from the 
front left wheel. 
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Figure 7: The experiment setup for the range study. 


the widest range of 9.1 meters at the parallel trajectory 
was 3 meters away from the x-axis, an USRP should be 
placed 2.5 meters away from the lane marks to maximize 
the chance of packet reception, assuming cars travel 0.5 
meter away from lane marks. 


Messaging rate. According to NHTSA regulations, 
TPMS sensors transmit pressure information every 60 
to 90 seconds. Our measurements confirmed that both 
TPS-A and TPS-B sensors transmit one packet every 60 
seconds or so. Interestingly, contrary to documentation 
(where sensors should report data periodically after a 
speed higher than 40 km/h), both sensors periodically 
transmit packet even when cars are stationary. Further- 
more, TPS-B transmits periodic packets even when the 
Car 1s not running. 


4.4 Lessons Learned: Feasibility of Track- 
ing Automobiles 


The surprising range of 40m makes it possible to capture 
a packet and its identifiers from the roadside, if the car 
is Stationary (e.g., a traffic light or a parking lot). Given 
that a TPMS sensor only send one message per minute, 
tracking becomes difficult at higher speeds. Consider, for 
example, a passive tracking system deployed along the 
roadside at highway entry and exit ramps, which seeks 
to extract the unique sensor ID for each car and link en- 
try and exit locations as well as subsequent trips. To en- 
sure capturing at least one packet, a row of sniffers would 
be required to cover the stretch of road that takes a car 
60 seconds to travel. The number of required sniffers, 
Npassive = cetl(v * T/Trmax), where v is the speed of 
the vehicle, T is the message report period, and rqz 1S 
the detection range of the sniffer. Using the sniffing sys- 
tem described in previous sections where Pmaz = 9.1 
m, 110 sniffers are required to guarantee capturing one 
packet transmitted by a car traveling at 60 km/h. De- 
ploying such a tracking system appears cost-prohibitive. 

It is possible to track with fewer sniffers, however, by 
leveraging the activation signal. The tracking station can 
send the 125kHz activation signal to trigger a transmis- 
sion by the sensor. To achieve this, the triggers and snif- 
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Figure 11: Study the angle of eavesdropping with LNA. 


fers should be deployed in a way such that they meet 
the following requirements regardless of the cars’ travel 
speeds: (1) the transmission range of the trigger should 
be large enough so that the passing car is able to receive 
the complete activation signal; (2) the sniffer should be 
placed at a distance from the activation sender so that the 
car is in the sniffers’ eavesdropping range when it starts 
to transmit; and (3) the car should stay within the eaves- 
dropping range before it finishes the transmission. 

To determine the configuration of the sniffers and the 
triggers, we conducted an epitomical study using a USRP 
with two daughterboards attached, one recording at 125 
kHz and the other recording at 315 MHz. Our results 
are depicted in Figure 9 and show that the activation sig- 
nal of TPS-B lasts approximately 359 ms. The sensors 
start to transmit 530 ms after the beginning of the acti- 
vation signal, and the data takes 15 ms to transmit. This 
means, that to trigger a car traveling at 60 km/h, the trig- 
ger should have a transmission range of at least 6 meters. 
Since a sniffer can eavesdrop up to 9.1 meters, it suffices 
to place the sniffer right next to the trigger. Additional 
sniffers could be placed down the road to capture pack- 
ets of cars traveling at higher speeds. 

To determine the feasibility of this approach, we have 
conducted a roadside experiment using the ATEQ VT55 
which has a transmission range of 0.5 meters. We were 
able to activate and extract the ID of a targeted TPMS 
sensor moving at the speed of 35 km/h using one sniffer. 
We note that ATEQ VT55 was deliberately designed with 
short transmission range to avoid activating multiple cars 
in the dealership. With a different radio frontend, such as 
using a matching antenna for 125 kHz, one can increase 
the transmission range of the trigger easily and enable 
capturing packets from cars at higher speeds. 


Comparison between tracking via TPMS and Au- 
tomatic Number Plate Reading. Automatic Number 
Plate Reading (ANPR) technologies have been proposed 
to track automobiles and leverage License Plate Cap- 
ture Cameras (LPCC) to recognize license plate num- 
bers. Due to the difference between underlying technolo- 
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ranges when the car is traveling at various Figure 9: Time series of activation and with two daughterboards are used to 


trajectories. data signals. 


gies, TPMS and ANPR systems exhibit different charac- 
teristics. First, ANPR allows for more direct linkage to 
individuals through law enforcement databases. ANPR 
requires, however, line of sight (LOS) and its accuracy 
can be affected by weather conditions (e.g. light or hu- 
midity) or the dirt on the plate. In an ideal condition with 
excellent modern systems, the read rate for license plates 
is approximately 90% [25]. A good quality ANPR cam- 
era can recognize number plates at 10 meters [5]. On 
the contrary, the ability to eavesdrop on the RF transmis- 
sion of TPMS packets does not depend on illumination 
or LOS. The probability of identifying the sensor ID is 
around 99% when the eavesdropper is placed 2.5 meters 
away from the lane marks. Second, the LOS require- 
ment forces the ANPR to be installed in visible locations. 
Thus, a motivated driver can take alternative routes or re- 
move/cover the license plates to avoid being detected. In 
comparison, the use of TPMS is harder to circumvent, 
and the ability to eavesdrop without LOS could lead to 
more pervasive automobile tracking. Although swapping 
or hiding license plates requires less technical sophistica- 
tion, it also imposes much higher legal risks than deacti- 
vating TPMS units. 


5 Feasibility of Packet Spoofing 


Being able to eavesdrop on TPMS communication from 
a distance allows us to further explore the feasibility of 
inserting forged data into safety-critical in-vehicle sys- 
tems. Such a threat presents potentially even greater 
risks than the tracking risks discussed so far. While 
the TPMS is not yet a highly safety-critical system, we 
experimented with spoofing attacks to understand: (1) 
whether the receiver sensitivity of an in-car radio is high 
enough to allow spoofing from outside the vehicle or a 
neighboring vehicle, and (2) security mechanisms and 
practices in such systems. In particular, we were curious 
whether the system uses authentication, input validation, 
or filtering mechanisms to reject suspicious packets. 
The packet spoofing system. Our live eavesdrop- 
per can detect TPMS transmission and decode both ASK 
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transmit data packets at 315/433 MHz. 


modulated TPS-A messages and FSK modulated TPS- 
B messages in real time. Our packet spoofing system is 
built on top of our live eavesdropper, as shown in Fig- 
ure 12. The Packet Generator takes two sets of parame- 
ters —sensor type and sensor ID from the eavesdropper; 
temperature, pressure, and status flags from users—and 
generates a properly formulated message. It then modu- 
lates the message at baseband (using ASK or FSK) while 
inserting the proper preamble. Finally, the rogue sensor 
packets are upconverted and transmitted (either contin- 
uously or just once) at the desired frequency (315/433 
MHz) using a customized GNU radio python script. We 
note that once the sensor ID and sensor type are captured 
we can create and repeatedly transmit the forged message 
at a pre-defined period. 

At the time of our experimentation, there were no 
USRP daughterboards available that were capable of 
transmitting at 315/433 MHz. So, we used a frequency 
mixing approach where we leveraged two XCVR2450 
daughterboards and a frequency mixer (mini-circuits 
ZLW 11H) as depicted in Fig.10. By transmitting a tone 
out of one XCVR2450 into the LO port of the mixer, 
we were able to mix down the spoofed packet from the 
other XCVR2450 to the appropriate frequency. For 315 
MHz, we used a tone at 5.0 GHz and the spoofed packet 
at 5.315 GHz.! 

To validate our system, we decoded spoofed packets 
with the TPMS trigger tool. Figure 13 shows a screen 
snapshot of the ATEQ VTS55 after receiving a spoofed 
packet with a sensor ID of “DEADBEEFP” and a tire pres- 
sure of O PSI. This testing also allowed us to understand 
the meaning of remaining status flags in the protocol. 


5.1 Exploring Vehicle Security 


We next used this setup to send various forged packets 
to a car using TPS-A sensors (belonging to one of the 


'For 433 MHz, the spoofed packet was transmitted at 5.433 GHz. 
We have also successfully conducted the experiment using two RFX- 
1800 daughterboards, whose operational frequencies are from 1.5 GHz 
to 2.1 GHz. 
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Figure 12: Block chart of the packet spoofing system. 


authors) at a rate of 40 packets per second. We made the 
following observations. 


No authentication. The vehicle ECU ignores packets 
with a sensor ID that does not match one of the known 
IDs of its tires, but appears to accept all other packets. 
For example, we transmitted forged packets with the ID 
of the left front tire and a pressure of 0 PSI and found 0 
PSI immediately reflected on the dashboard tire pressure 
display. By transmitting messages with the alert bit set 
we were able to immediately illuminate the low-pressure 
warning light”, and with about 2 seconds delay the ve- 
hicle’s general-information warning light, as shown in 
Figure 14. 


No input validation and weak filtering. We forged 
packets at a rate of 40 packets per second. Neither this 
increased rate, nor the occasional different reports by 
the real tire pressure sensor seemed to raise any suspi- 
cion in the ECU or any alert that something was wrong. 
The dashboard simply displayed the spoofed tire pres- 
sure. We next transmitted two packets with very differ- 
ent pressure values alternately at a rate of 40 packets per 
second. The dashboard display appeared to randomly 
alternate between these values. Similarly, when alter- 
nating between packets with and without the alert flag, 
we observed the warning lights switched on and off at 
non-deterministic time intervals. Occasionally, the dis- 
play seemed to freeze on one value. These observations 
suggest that TPMS ECU employs trivial filtering mecha- 
nisms which can be easily confused by spoofed packets. 

Interestingly, the illumination of the low-pressure 
warning light depends only on the alert bit—the light 
turns on even if the rest of the message reports a nor- 
mal tire pressure of 32 PSI! This further illustrates that 
the ECU does not appear to use any input validation. 


Large range of attacks. We first investigated the 
effectiveness of packet spoofing when vehicles are sta- 
tionary. We measured the attack range when the packet 
spoofing system was angled towards the head of the car, 
and we observed a packet spoofing range of 38 meters. 
For the purpose of proving the concept, we only used 
low-cost antennas and radio devices in our experiments. 
We believe that the range of packet spoofing can be 
greatly expanded by applying amplifiers, high-gain an- 
tennas, or antenna arrays. 


*To discover this bit we had to deflate one tire and observe the tire 
pressure sensors response. Simply setting a low pressure bit or report- 
ing low pressure values did not trigger any alert in the vehicle. 
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Feasibility of Inter-Vehicle Spoofing. We deployed 
the attacks against willing participants on highway I-26 
to determine if they are viable at high speeds. Two cars 
owned by the authors were involved in the experiment. 
The victim car had TPS-A sensors installed and the at- 
tacker’s car was equipped with our packet spoofing sys- 
tem. Throughout our experiment, we transmitted alert 
packets using the front-left-tire ID of the target car, while 
the victim car was traveling to the right of the attacker’s 
car. We observed that the attacker was able to trigger 
both the low-pressure warning light and the car’s central- 
warning light on the victim’s car when traveling at 55 
km/h and 110 km/h, respectively. Additionally, the low- 
pressure-warning light illuminated immediately after the 
attacker entered the packet spoofing range. 


5.2 Exploring the Logic of ECU Filtering 


Forging a TPMS packet and transmitting it at a high rate 
of 40 packets per second was useful to validate packet 
spoofing attacks and to gauge the spoofing range. Be- 
yond this, though, it was unclear whether there were fur- 
ther vulnerabilities in the ECU logic. To characterize the 
logic of the ECU filtering mechanisms, we designed a 
variety of spoofing attacks. The key questions to be an- 
swered include: (1) what is the minimum requirement to 
trigger the TPMS warning light once, (2) what is the min- 
imum requirement to keep the TPMS warning light on 
for an extended amount of time, and (3) can we perma- 
nently illuminate any warning light even after stopping 
the spoofing attack? 

So far, we have observed two levels of warning lights: 
TPMS Low-Pressure Warning light (TPMS-LPW) and 
the vehicle’s general-information warning light illustrat- 
ing “Check Tire Pressure’. In this section, we explored 
the logic of filtering strategies related to the TPMS- 
LPW light in detail. The logic controlling the vehicle’s 
general-information warning light can be explored in a 
similar manner. 


5.2.1 Triggering the TPMS-LPW Light 


To understand the minimum requirement of triggering 
the TPMS-LPW light, we started with transmitting one 
spoofed packet with the rear-left-tire ID and eavesdrop- 
ping the entire transmission. We observed that (1) one 
spoofed packet was not sufficient to trigger the TPMS- 
LPW light; and (2) as a response to this packet, the 
TPMS ECU immediately sent two activation signals 
through the antenna mounted close to the rear left tire, 
causing the rear left sensor to transmit eight packets. 
Hence, although a single spoofed packet does not cause 
the ECU to display any warning, it does open a vulnera- 
bility to battery drain attacks. 
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Figure 13: The TPMS trigger tool dis- 
plays the spoofed packet with the sen- 
sor ID “DEADBEEP’’. We crossed out 
the brand of TP sensors to avoid legal 


Figure 14: Dash panel snapshots: (a) the tire pressure of left front tire displayed 
as 0 PSI and the low tire pressure warning light was illuminated immediately after 
sending spoofed alert packets with O PSI; (b) the car computer turned on the general 


issues. warning light around 2 seconds after keeping sending spoofed packets. 


Next, we gradually increased the number of spoofed 
packets, and we found that transmitting four spoofed 
packets in one second suffices to illuminate the TPMS- 
LPW light. Additionally, we found that those four 
spoofed packets have to be at least 225 ms apart, oth- 
erwise multiple spoofed packets will be counted as one. 
When the interval between two consecutive spoofed 
packets is larger than 4 seconds or so, the TPMS-LPW 
no longer illuminates. This indicates that TPMS adopts 
two detection windows with sizes of 240 ms (a packet 
lasts for 15 ms) and 4 seconds. A 240-ms window is 
considered positive for low tire pressure if at least one 
low-pressure packet has been received in that window 
regardless of the presence of numerous normal packets. 
Four 240-ms windows need to be positive to illuminate 
the TPMS-LPW light. However, the counter for positive 
240-ms windows will be reset if no low-pressure packet 
is received within a 4-s window. 

Although the TPMS ECU does use a counting thresh- 
old and window-based detection strategies, they are de- 
signed to cope with occasionally corrupted packets in a 
benign situation and are unable to deal with malicious 
spoofing. Surprisingly, although the TPMS ECU does 
receive eight normal packets transmitted by sensors as 
a response to its queries, it still concludes the low-tire- 
pressure status based on one forged packet, ignoring the 
majority of normal packets! 


5.2.2 Repeatedly Triggering the TPMS-LPW Light 


The TPMS-LPW light turns off a few seconds if only 
four forged packets are received. To understand how 
to sustain the warning light, we repeatedly transmitted 
spoofed packets and increased the spoofing period grad- 
ually. The TPMS-LPW light remained illuminated when 
we transmitted the low-pressure packet at a rate higher 
than one packet per 240 ms, e.g., one packet per detection 
window. Spoofing at a rate between one packet per 240 
ms to 4 seconds caused the TPMS-LPW light to toggle 
between on and off. However, spoofing at a rate slower 
than 4 seconds could not activate the TPMS-LPW light, 
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which confirmed our prior experiment results. Figure 15 
depicts the measured TPMS-LPW light on-durations and 
off-durations when the spoofing periods increased from 
44 ms to 4 seconds. 

As we increased the spoofing period, the TPMS-LPW 
light remained on for about 6 seconds on average, but 
the TPMS-LPW light stayed off for an incrementing 
amount of time which was proportional to the spoofing 
period. Therefore, it is very likely that the TPMS-ECU 
adopts a timer to control the minimum on-duration and 
the off-duration of TPMS-LPW light can be modeled as 
torf = 3.5x + 4, where zx is the spoofing period. The 
off-duration includes the amount of time to observe four 
low-pressure forged messages plus the minimum waiting 
duration for the TPMS-ECU to remain off, e.g., 4 sec- 
onds. In fact, this confirms our observation that there is 
a waiting period of approximately 4 seconds before the 
TPMS warning light was first illuminated. 


5.2.3. Beyond Triggering the TPMS-LPW Light 


Our previous spoofing attacks demonstrated that we can 
produce false TPMS-LPW warnings. In fact, transmit- 
ting forged packets at a rate higher than one packet per 
second also triggered the vehicle’s general-information 
warning light illustrating ‘Check Tire Pressure’. De- 
pending on the spoofing period, the gap between the 
illumination of the TPMS-LPW light and the vehicle’s 
general-information warning light varied between a few 
seconds to 130 seconds — and the TPMS-LPW light re- 
mained illuminated afterwards. 

Throughout our experiments, we typically exposed the 
car to spoofed packets for a duration of several minutes at 
a time. While the TPMS-LPW light usually disappeared 
about 6 seconds after stopping spoofed message trans- 
missions, we were once unable to reset the light even by 
turning off and restarting the ignition. It did, however, 
reset after about 10 minutes of driving. 

To our surprise, at the end of only two days of spo- 
radic experiments involving triggering the TPMS warn- 
ing on and off, we managed to crash the TPMS ECU and 
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Figure 15: TPMS low-pressure warning light on and off dura- 
tion vs. spoofing periods. 


completely disabled the service. The vehicle’s general- 
information warning light illustrating ‘Check TPMS Sys- 
tem’ was activated and no tire pressure information was 
displayed on the dashboard, as shown in Figure 16. We 
attempted to reset the system by sending good packets, 
restarting the car, driving on the highway for hours, and 
unplugging the car battery. None of these endeavors 
were successful. Eventually, a visit to a dealership recov- 
ered the system at the cost of replacing the TPMS ECU. 
This incident suggests that it may be feasible to crash the 
entire TPMS and the degree of such an attack can be so 
severe that the owner has no option but to seek the ser- 
vices of a dealership. We note that one can easily explore 
the logic of a vehicle’s general-information warning light 
using similar methods for TPMS-LPW light. We did not 
pursue further analysis due to the prohibitive cost of re- 
pairing the TPMS ECU. 


5.3. Lessons Learned 


The successful implementation of a series of spoofing at- 
tacks revealed that the ECU relies on sensor IDs to filter 
packets, and the implemented filter mechanisms are not 
effective in rejecting packets with conflicting informa- 
tion or abnormal packets transmitted at extremely high 
rates. In fact, the current filer mechanisms introduce se- 
curity risks. For instance, the TPMS ECU will trigger 
the sensors to transmit several packets after receiving one 
spoofed message. Those packets, however, are not lever- 
aged to detect conflicts and instead can be exploited to 
launch battery drain attacks. In summary, the absence of 
authentication mechanisms and weak filter mechanisms 
open many loopholes for adversaries to explore for more 
‘creative’ attacks. Furthermore, despite the unavailabil- 
ity of a radio frontend that can transmit at 315/433 MHz, 
we managed to launch the spoofing attack using a fre- 
quency mixer. This result is both encouraging and alarm- 
ing since it shows that an adversary can spoof packets 
even without easy access to transceivers that operate at 
the target frequency band. 
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Figure 16: Dash panel snapshots indicating the TPMS system 
error (this error cannot be reset without the help of a dealer- 
ship): (a) the vehicle’s general-information warning light; (b) 
tire pressure readings are no longer displayed as a result of sys- 
tem function errors. 


6 Protecting TPMS Systems from Attacks 


There are several steps that can improve the TPMS de- 
pendability and security. Some of the problems arise 
from poor system design, while other issues are tied to 
the lack of cryptographic mechanisms. 


6.1 Reliable Software Design 


The first recommendation that we make 1s that software 
running on TPMS should follow basic reliable software 
design practices. In particular, we have observed that it 
was possible to convince the TPMS control unit to dis- 
play readings that were clearly impossible. For example, 
the TPMS packet format includes a field for tire pressure 
as well as a separate field for warning flags related to tire 
pressure. Unfortunately, the relationship between these 
fields were not checked by the TPMS ECU when pro- 
cessing communications from the sensors. As noted ear- 
lier, we were able to send a packet containing a legitimate 
tire pressure value while also containing a low tire pres- 
sure warning flag. The result was that the driver’s dis- 
play indicated that the tire had low pressure even though 
its pressure was normal. A straight forward fix for this 
problem (and other similar problems) would be to update 
the software on the TPMS control unit to perform con- 
sistency checks between the values in the data fields and 
the warning flags. Similarly, when launching message 
spoofing attacks, although the control unit does query 
sensors to confirm the low pressure, it neglects the le- 
gitimate packet responses completely. The control unit 
could have employed some detection mechanism to, at 
least, raise an alarm when detecting frequent conflicting 
information, or have enforced some majority logic oper- 
ations to filter out suspicious transmissions. 


6.2 Improving Data Packet Format 


One fundamental reason that eavesdropping and spoof- 
ing attacks are feasible in TPMS systems is that packets 
are transmitted in plaintext. To prevent these attacks, a 
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first line of defense is to encrypt TPM packets*. The ba- 
sic packet format ina TPMS system included a sensor ID 
field, fields for temperature and tire pressure, fields for 
various warning flags, and a checksum. Unfortunately, 
the current packet format used is ill-suited for proper en- 
cryption, since naively encrypting the current packet for- 
mat would still support dictionary-based cryptanalysis as 
well as replay attacks against the system. For this reason, 
we recommend that an additional sequence number field 
be added to the packet to ensure freshness of a packet. 
Further, requiring that the sequence number field be in- 
cremented during each transmission would ensure that 
subsequent encrypted packets from the same source be- 
come indistinguishable, thereby making eavesdropping 
and cryptanalysis significantly harder. We also recom- 
mend that an additional cryptographic checksum (e.g. a 
message authentication code) be placed prior to the CRC 
checksum to prevent message forgery. 

Such a change in the payload would require that 
TPMS sensors have a small amount memory in order to 
store cryptographic keys, as well as the ability to perform 
encryption. An obvious concern is the selection of cryp- 
tographic algorithms that are sufficiently light-weight to 
be implemented on the simple processor within a TPMS 
sensor, yet also resistant to cryptanalysis. A secondary 
concern is the installation of cryptographic keys. We en- 
vision that the sensors within a tire would be have keys 
pre-installed, and that the corresponding keys could be 
entered into the ECU at the factory, dealership, or a cer- 
tified garage. Although it is unlikely that encryption and 
authentication keys would need to be changed, it would 
be a simple matter to piggy-back a rekeying command 
on the 125kHz activation signal in a manner that only 
certified entities could update keys. 


6.3 Preventing Spoofed Activation 


The spoofing of an activation signal forces sensors to 
emit packets and facilitates tracking and battery drain at- 
tacks. Although activation signals are very simple, they 
can convey a minimal amount of bits. Thus, using a long 
packet format with encryption and authentication is un- 
suitable, and instead we suggest that the few bits they can 
convey be used as a sequencing field, where the sequenc- 
ing follows a one-way function chain in a manner anal- 
ogous to one-time signatures. Thus, the ECU would be 
responsible for maintaining the one-way function chain, 
and the TPMS sensor would simply hash the observed 
sequence number and compare with the previous se- 
quence number. This would provide a simple means of 
filtering out false activation signals. We note that other 


3We note that encrypting the entire message (or at least all fields 


that are not constant across different cars) is essential as otherwise the 
ability to read these fields would support a privacy breach. 
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legitimate sources of activation signals are specialized 
entities, such as dealers and garages, and such entities 
could access an ECU to acquire the position within the 
hash chain in order to reset their activation units appro- 
priately to allow them to send valid activation signals. 


7 Related Work 


Wireless devices have become an inseparable part of our 
social fabric. As such, much effort has been dedicated 
to analyze the their privacy and security issues. Devices 
being studied include RFID systems [27, 30, 41], mass- 
market UbiComp devices [38], household robots [14], 
and implantable medical devices [21]. Although our 
work falls in the same category and complements those 
works, TPMS in automobiles exhibits distinctive features 
with regard to the radio propagation environment (strong 
reflection within and off metal car bodies), ease of access 
by adversaries (cars are left unattended in public), span 
of usage, a tight linkage to the owners, etc. All these 
characteristics have motivated this in-depth study on the 
security and privacy of TPMS. 

One related area of research is location privacy in 
wireless networks, which has attracted much attention 
since wireless devices are known to present tracking 
risks through explicit identifiers in protocols or identi- 
fiable patterns in waveforms. In the area of WLAN, 
Brik et al. have shown the possibility to identify users 
by monitoring radiometric signatures [10]. Gruteser ez 
al. [19] demonstrated that one can identify a user’s loca- 
tion through link- and application-layer information. A 
common countermeasure against breaching location pri- 
vacy is to frequently dispose user identity. For instance, 
Jiang et al. [24] proposed a pseudonym scheme where 
users change MAC addresses each session. Similarly, 
Greenstein et al. [18] have suggested an identifier-free 
mechanism to protect user identities, whereby users can 
change addresses for each packet. 

In cellular systems, Lee et al. have shown that the lo- 
cation information of roaming users can be released to 
third parties [28], and proposed using the temporary mo- 
bile subscriber identifier to cope with the location privacy 
concern. IPv6 also has privacy concerns caused by the 
fixed portion of the address [32], and thus the use of peri- 
odically varying pseudo-random addresses has been rec- 
ommended. The use of pseudonyms is not sufficient to 
prevent automobile tracking since the sensors report tire 
pressure and temperature readings, which can be used 
to build a signature of the car. Furthermore, pseudonyms 
cannot defend against packet spoofing attacks such as we 
have examined in this paper. 

Security and privacy in wireless sensor networks have 
been studied extensively. Perrig et al. [37] have proposed 
a suite of security protocols to provide data confidential- 
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ity and authentication for resource-constrained sensors. 
Random key predistribution schemes [12] have been pro- 
posed to establish pairwise keys between sensors on de- 
mand. Those key management schemes cannot work 
well with TPMS, since sensor networks are concerned 
with establishing keys among a large number of sensors 
while the TPMS focuses on establishing keys between 
four sensors and the ECU only. 

Lastly, we note related work on the security of a car’s 
computer system [26]. Their work involved analyzing 
the computer security within a car by directly mounting 
a malicious component into a car’s internal network via 
the On Broad Diagnostics (OBD) port (typically under 
the dash board), and differs from our work in that we 
were able to remotely affect an automobile’s security at 
distances of 40 meters without entering the car at all. 


8 Concluding Remarks 


Tire Pressure Monitoring Systems (TPMS) are the first 
in-car wireless network to be integrated into all new cars 
in the US and will soon be deployed in the EU. This pa- 
per has evaluated the privacy and security implications 
of TPMS by experimentally evaluating two representa- 
tive tire pressure monitoring systems. Our study revealed 
several security and privacy concerns. First, we reverse 
engineered the protocols using the GNU Radio in con- 
junction with the Universal Software Radio Peripheral 
(USRP) and found that: (14) the TPMS does not employ 
any cryptographic mechanisms and (11) transmits a fixed 
sensor ID in each packet, which raises the possibility of 
tracking vehicles through these identifiers. Sensor trans- 
missions can be triggered from roadside stations through 
an activation signal. We further found that neither the 
heavy shielding from the metallic car body nor the low- 
power transmission has reduced the range of eavesdrop- 
ping sufficiently to reduce eavesdropping concerns. In 
fact, TPMS packets can be intercepted up to 40 meters 
from a passing car using the GNU Radio platform with a 
low-cost, low-noise amplifier. We note that the eaves- 
dropping range could be further increased with direc- 
tional antennas, for example. 

We also found out that current implementations do 
not appear to follow basic security practices. Messages 
are not authenticated and the vehicle ECU also does not 
appear to use input validation. We were able to inject 
spoofed messages and illuminate the low tire pressure 
warning lights on a car traveling at highway speeds from 
another nearby car, and managed to disable the TPMS 
ECU by leveraging packet spoofing to repeatedly turn on 
and off warning lights. 

Finally, we have recommended security mechanisms 
that can alleviate the security and privacy concerns pre- 
sented without unduly complicating the installation of 
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new tires. The recommendations include standard reli- 
able software design practices and basic cryptographic 
recommendations. We believe that our analysis and rec- 
ommendations on TPMS can provide guidance towards 
designing more secure in-car wireless networks. 
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Abstract 


The browser has become the de facto platform for ev- 
eryday computation. Among the many potential attacks 
that target or exploit browsers, vulnerabilities in browser 
extensions have received relatively little attention. Cur- 
rently, extensions are vetted by manual inspection, which 
does not scale well and is subject to human error. 

In this paper, we present VEX, a framework for high- 
lighting potential security vulnerabilities in browser ex- 
tensions by applying static information-flow analysis to 
the JavaScript code used to implement extensions. We 
describe several patterns of flows as well as unsafe pro- 
gramming practices that may lead to privilege escala- 
tions in Firefox extensions. VEX analyzes Firefox ex- 
tensions for such flow patterns using high-precision, 
context-sensitive, flow-sensitive static analysis. We an- 
alyze thousands of browser extensions, and VEX finds 
six exploitable vulnerabilities, three of which were previ- 
ously unknown. VEX also finds hundreds of examples of 
bad programming practices that may lead to security vul- 
nerabilities. We show that compared to current Mozilla 
extension review tools, VEX greatly reduces the human 
burden for manually vetting extensions when looking for 
key types of dangerous flows. 


1 Introduction 


Driving the Internet revolution is the modern web 
browser, which has evolved from a relatively simple 
client application designed to display static data into a 
complex networked operating system tasked with man- 
aging many facets of a user’s on-line experience. To 
help meet the varied needs of a broad user population, 
browser extensions expand the functionality of browsers 
by interposing on and interacting with browser-level 
events and data. Some extensions are simple and make 
only small changes to the appearance of web pages or the 
browser itself. Other extensions provide more sophis- 
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ticated functionality, such as NOSCRIPT that provides 
fine-grained control over JavaScript execution [20], or 
GREASEMONKEY that provides a full-blown program- 
ming environment for scripting browser behavior [6]. 
These are just a few of the thousands of extensions cur- 
rently available for Firefox, the second most popular 
browser today!. 

Extensions written with benign intent can have subtle 
vulnerabilities that expose the user to a disastrous attack 
from the web, often just by viewing a web page. Fire- 
fox extensions run with full browser privileges, so at- 
tackers can potentially exploit extension weaknesses to 
take over the browser, steal cookies or protected pass- 
words, compromise confidential information, or even hi- 
jack the host system, without revealing their actions to 
the user. Unfortunately, tens of extension vulnerabili- 
ties have been discovered in the last few years, and capa- 
ble attacks against buggy extensions have already been 
demonstrated [23]. 

To help reduce the attack surface for extensions, 
Mozilla provides a set of security primitives to ex- 
tension developers. However, these security primi- 
tives are discretionary, and can be difficult to under- 
stand and use correctly. For example, Firefox pro- 
vides an evalInSandbox (text, sandbox) function 
that returns the result of evaluating the text string 
under the restricted privileges associated with the en- 
vironment sandbox. Using evalInSandbox correctly 
requires developers to test the result of a call to 
evalInSandbox with the non-traditional “===” rather 
than “==’’, as the “==” operation may invoke unsafe code 
as a side effect (See http: //developer.mozilla.org/ 
En/Components.utils.evalInSandbox for details). 

Current approaches from the research community pro- 
pose dynamic techniques for improving the security of 
extensions. The SABRE system tracks tainted JavaScript 


‘Firefox now surpasses Internet Explorer in W3schools traffic 


(www.w3schools.com/browsers/browsers_stats.asp), 
arguably due to the popularity of Firefox extensions. 
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objects to prevent extensions from accessing sensitive 1n- 
formation unsafely [9]. Although SABRE can prevent po- 
tentially malicious flows from both exploited extensions 
and from malicious extensions, SABRE adds overhead to 
all JavaScript execution within the browser, adding 6.1x 
overhead for the SunSpider benchmark and 2.36x over- 
head for the V8 JavaScript benchmark. Furthermore, 
SABRE’s dynamic nature pushes security violation no- 
tification to users who are unable to determine if a par- 
ticular flow is malicious or benign. The Google Chrome 
Extension System revisits the overall extension API to 
make it easier for the browser to enforce least privilege 
and strong isolation on extensions [4]. Their system 
works by partitioning the full set of extension function- 
ality into different protection domains, and sand-boxing 
extensions to prevent them from obtaining more privi- 
leges than needed. Although this system is likely to limit 
the damage from some extension attacks, it does little to 
prevent the vulnerabilities themselves. 


In this paper, we propose VEX, a system for find- 
ing vulnerabilities in browser extensions using static 
information-flow analysis. Many vulnerabilities trans- 
late to certain types of explicit information flows from 
injectable sources to executable sinks. For extensions 
written with benign intent, most attacks involve the at- 
tacker injecting JavaScript into a data item that is sub- 
sequently executed by the extension under full browser 
privileges. We identify key flows of this nature that can 
lead to security vulnerabilities, and we analyze for these 
flows statically using high-precision static analysis that 
is both path-sensitive and context-sensitive, to minimize 
the number of false positive suspect flows. VEX uses 
precise summaries to analyze code, and has special fea- 
tures to handle the quirks of JavaScript (e.g., VEX does 
a constant string analysis for expressions that flow into 
the eval statement). Because VEX uses static analysis, 
we avoid the runtime overhead induced by dynamic ap- 
proaches. 


Determining whether extensions are malicious or har- 
bor security vulnerabilities is a hard problem. Exten- 
sions are typically complex artifacts that interact with 
the browser in subtle and hard to understand ways. For 
example, the ADBLOCK PLUS extension performs the 
seemingly simple task of filtering out ads based on a 
list of ad servers. However, the ADBLOCK PLUS im- 
plementation consists of over 11K lines of JavaScript 
code. Similarly, the NOSCRIPT extension provides fine- 
grained control over which domains are allowed to ex- 
ecute JavaScript and basic cross-site scripting protec- 
tion. The NOSCRIPT extension implementation consists 
of over 19K lines of JavaScript code. Also, ADBLOCK 
PLUS had 30 releases in 1/1/06—11/20/09, and No- 
SCRIPT had 38 releases just in 1/1/09-11/20/09. While 
Mozilla uses volunteers to vet each new extension and re- 
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vision before posting it on their official list of approved 
Firefox extensions, examining an extension to find a vul- 
nerability requires a detailed understanding of the code 
to reason about anything beyond the most basic type of 
information flow. Thus tools to help vet browser exten- 
sions can be very useful for improving the security of 
extensions. 

We show that VEX can catch several known vulnera- 
bilities, such as a vulnerability in the FIZZLE extension 
[8], and also find new problems, including exploitable 
vulnerabilities in BEATNIK and WIKIPEDIA TOOLBAR. 
In particular, VEX reported a previously unknown vul- 
nerability in WIKIPEDIA TOOLBAR that could lead to an 
attack, and that resulted in the report CVE-2009-4127. 
We reported this vulnerability to the WIKIPEDIA TOOL- 
BAR developers, who fixed the extension. We also show 
that VEX can help to find the use of unsafe programming 
practices, such as misuse of evalInSandbox, that can 
result from subtle information flows. 

The remainder of the paper is organized as follows. 
Section 2 describes the threat model and the assumptions 
under which we analyze the browser extensions. Sec- 
tion 3 provides background material on the architecture 
of Firefox and the nature of certain key undesirable in- 
formation flows in its extensions. Section 4 describes our 
static analysis and the various design choices we made to 
build VEX. Section 5 lists and describes our results. Sec- 
tion 6 surveys related work, and Section 7 concludes the 


paper. 


2 Threat model, assumptions, and usage 
model 


In this paper, we focus on finding security vulnerabili- 
ties in buggy browser extensions. We do not try to iden- 
tify malicious extensions, bugs in the browser itself, or 
bugs in other browser extensibility mechanisms, such as 
plug-ins. We assume that the developer is neither mali- 
cious nor trying to obfuscate extension functionality, but 
we assume the developer could write incorrect code that 
contains vulnerabilities. 

We use two attack models. First, we consider attacks 
that originate from web sites, and we assume the attacker 
can send arbitrary HTML and JavaScript to the user’s 
browser. We focus on attacks where this untrusted data 
can lead to code injection or privilege escalation through 
buggy extensions. In the second attack model, we con- 
sider some web sites as trusted. For example, if an exten- 
sion gleans information from Facebook, we assume that 
the Facebook code will not include arbitrary HTML and 
JavaScript, but only well formatted and trusted data. 

According to the Mozilla developer site, Mozilla has 
a team of volunteers who help vet extensions manually. 


USENIX Association 


2. Feed them to the 
VEX Analyzer 





Uncompressed —) 
Extensions 


1. Download extensions and 
uncompress them 


for flow and unsafe 


programming patterns 


3. VEX analyzes JavaScript 





4. VEX outputs extensions Safe 
that have flows ; 


Extension 


Attackable 


Extension 


5. Extension vetter 
manually analyzes the 
flows for vulnerabilities 





Figure 1: The overall analysis process of VEX. 


They run new and updated extensions isolated in a vir- 
tual machine to test the user experience. The editors also 
use a validation tool, which uses grep to look for key in- 
dicators of bugs. Many of the patterns they search for 
involve interactions between extensions and web pages, 
and they use their understanding of these patterns to help 
guide their inspection of the code. Our goal is to help 
automate this process, so that analysts can quickly hone 
in on particular snippets of code that are likely to contain 
security vulnerabilities. Figure 1 shows our overall work 
flow for using VEX. 


3 Background 


3.1 Mozilla privilege levels 


Firefox has two privilege levels: page, for the web page 
displayed in the browser’s content pane; and chrome, for 
elements belonging to Firefox and its extensions, 1.e., ev- 
erything surrounding the content pane. Page privileges 
are more restrictive than chrome privileges. For exam- 
ple, a page loaded from site x cannot access content from 
sites other than x. General Firefox code runs with full 
chrome privileges, which give it access to all browser 
states and events, OS resources like the file system and 
network, and all web pages. Firefox provides the ex- 
tensions with full chrome privileges by exposing a spe- 
cial API called the XPCOM Components to extension 
JavaScript, thereby allowing the extensions to have ac- 
cess to all the resources Firefox can access. 

Extensions can often access objects that run with page 
privileges and interact with page content, as well as ob- 
jects that run with full chrome privileges. Extensions can 
also include user interface components via a chrome doc- 
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ument, which also runs with full chrome privileges. For 
example, the object window refers to the chrome win- 
dow and the object window.content refers to the con- 
tent window. To access the document object referring 
to the content (i.e., the user page), the extension has to 
access the document property of the content window, 
1.e., Window.content.document. 

To make this extension architecture practical, Firefox 
has APIs for extension code to communicate across pro- 
tection domains. These interactions are one cause of ex- 
tension security vulnerabilities. As the Mozilla devel- 
oper site explains, “One of the most common security is- 
sues with extensions is execution of remote code in privi- 
leged context. A typical example is an RSS reader exten- 
sion that would take the content of the RSS feed (HTML 
code), format it nicely and insert into the extension win- 
dow. The issue that is commonly overlooked here is that 
the RSS feed could contain some malicious JavaScript 
code and it would then execute with the privileges of the 
extension — meaning that it would get full access to the 
browser (cookies, history etc) and to user’s files” [sic]. 


3.2 Points of attack 


Here we discuss key vulnerable points for code injection 
and privilege escalation attacks against non-malicious 
extensions: eval, evalInSandbox, innerHTML, and 
wrappedJSObject. We focus on these Firefox features 
because they are key points of interaction between ob- 
jects with page and chrome privileges, respectively, and 
this interaction is a key source of security vulnerabilities, 
as noted above. Though other avenues of attack are pos- 
sible, we do not consider them here. 
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eval: The eval function call interprets string data as 
JavaScript, which it executes dynamically. This flexible 
mechanism can be used to generate JavaScript code dy- 
namically, for example to serialize JSON objects. How- 
ever, this flexibility can lead to code injection vulnera- 
bilities in extensions. If extensions execute eval func- 
tions on un-sanitized strings that come from untrusted 
web pages, the attacker will be able to inject JavaScript 
code that will run with full chrome privileges. 


InnerHTML: Each HTML element for a page has an 
innerHTML property that defines the text that occurs be- 
tween that element’s opening and closing tag. Exten- 
sions can change the innerHTML property to alter ex- 
isting document object model (DOM) elements, or to 
add new DOM elements. When an extension modifies 
the innerHTML property, the browser re-parses and pro- 
cesses the new data. Thus, passing specially crafted un- 
sanitized strings (e.g., <img> tags with script in their 
onload attribute) into innerHTML modifications can 
lead to code injection attacks. 


EvalInSandbox: One way Firefox facilitates com- 
munication across protection domains is through the 
evalInSandbox method. This method enables exten- 
sions to execute JavaScript in the extension’s context 
with restricted privileges, thus enabling extensions to 
process untrusted data from web pages safely. The 
sandbox object is an empty JavaScript object created 
with restricted privileges. For example, the call s = 
Sandbox ("http://www.w3.org/") creates a sandbox 
s where code can execute with page privileges, as though 
it came from the domain www.w3.org. One can add 
properties to this object by calling the evalInSandbox 
function, and any attempts to access global scope ob- 
jects from within eval InSandbox, including privileged 
chrome objects, are denied. evalInSandbox compli- 
cates extension programming because objects returned 
from the method call execute in the extension with full 
chrome privileges. Since methods associated with the 
object could have been modified within the sandbox, they 
should not be called in the chrome context. For example, 
“==” should not be used on these objects as its evaluation 
calls the tostring or valueOf method, which could 
have been modified; instead the non-traditional “===” 
operator needs to be used. 


wrappedJSObject: JavaScript objects can be dynam- 
ically modified. That means that any web page can 
modify the properties of the document object. For ex- 
ample, a web page can reassign the getElementById 
method to return a malicious script. To prevent this 
script from being executed by the extension when 
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it calls window.content.document.getElementBylId, 
Firefox automatically wraps the object so that the 
window.content.document accesses only use the orig- 
inal document object, not the modified one. However, 
Firefox also provides the wrappedJSObject method, 
which lets the extension access the modified version, 
even when automatic wrapping is turned on; calling 
wrappedJSObject on a content document is potentially 
dangerous. 


3.3. Suspicious flow patterns 


In this section we discuss the five source to sink 
flows that might be vulnerable. Specifically, we track 
flows from Resource Description Framework (RDF) 
data (e.g., bookmarks) to innerHTML, content document 
data to eval, content document data to innerHTML, 
evalInSandbox return objects used improperly by code 
running with chrome privileges, or wrappedJSObject 
return object used improperly by code running with 
chrome privileges. These flows do not always result in 
a vulnerability, and they are by no means an exhaustive 
list of all possible extension security bugs, but they are 
the patterns we use in our tool. 

RDF is a model for describing hierarchical relation- 
ships between browser resources [33]. Extension de- 
velopers can store persistent extension data in an RDF 
file, or access browser resources, such as bookmarks, 
stored in RDF format. RDF data can come from un- 
trusted sources. For example, when a user stores a book- 
mark, Firefox records the un-sanitized title of the book- 
marked page in the RDF file. Extensions that use RDF 
data need to sanitize it properly if they use it directly in 
an innerHTML statement that modifies an element in a 
chrome document. 

Content document data flowing to eval or innerHTML 
can sometimes be exploited. This flow can result in script 
execution with chrome privileges if specially crafted 
content from the window. content .document ob- 
ject is passed to eval or innerHTML or an element in the 
chrome document. 

For evalInSandbox and wrappedJSObject, prob- 
lems can only result if the return values of these 
constructs are executed with chrome privileges. For 
evalInSandbox this means comparing return values us- 
ing == or != from code running with chrome privileges. 
For wrappedJSObject, this means making method calls 
on returned objects from code running with chrome priv- 
ileges. 

Such flow patterns may occur in only a few 
of the extensions that use these constructs. Ac- 
cording to the Mozilla extension review web page, 
reviewers have an open-source automatic tool to 
help with reviews (see https://addons.mozilla.org/ 
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en-US/firefox/pages/validation), but this tool just 
greps for strings that indicate dangerous patterns. Af- 
terward, the reviewer must go through the code of each 
suspect extension to understand the flows and determine 
which constitute vulnerabilities and which are benign. 
As this task is difficult, painful, and error-prone, we de- 
signed the VEX tool to help extension reviewers vet the 
flows in extensions automatically, greatly reducing the 
number of extensions that need manual review. 


4 Static information flow analysis 


We develop a general explicit information flow static 
analysis tool VEX for JavaScript that computes flows be- 
tween any source and sink, including the flows described 
in Section 3.3. While we could develop analysis tech- 
niques for a particular source and sink, we prefer a more 
general technique that will perform the analysis once, 
and from the results, allow us to search for any source- 
to-sink flow. This allows VEX to be run in a single pass 
over thousands of extensions, rather than using separate 
passes for each target pattern. 

To support fine-grained information-flow analysis, 
V EX tracks the precise dependencies of flows from vari- 
ables to objects created in the JavaScript extension, using 
a taint-based analysis. Motivated by the fact that every 
flow reported needs to be checked manually for attacks, 
which can take considerable human effort, we aim for 
an analysis that admits as few false positives as possi- 
ble (false positives are non-existent flows reported by the 
tool). 

Statically analyzing JavaScript extensions for flows is 
a non-trivial task. JavaScript extensions have a large 
number of objects and functions. In addition to the ob- 
jects defined in the program, the extensions can also ac- 
cess the browser’s DOM API and the Firefox Extension 
API provided by XPCOM components. The objects are 
also dynamic, in the sense that new object properties can 
be created dynamically at run-time. Functions are ob- 
jects in JavaScript, and hence can be created, redefined 
dynamically, and passed as parameters. The challenge is 
to accurately keep track of such objects, properties, and 
the corresponding flows to them. 

Our analysis keeps track of an abstract heap (AH) that 
is not a priori bounded, and keeps track of the precise 
heap nodes and field relations and corresponding flows, 
but ignores the exact primitive values in the heap (like 
integers). However, we bound the number of iterations 
in computing the least fixed-point, and hence the abstract 
heap gets bounded implicitly. 

The abstract heap transformations for any statement 
closely mimic a big-step operational semantics for 
JavaScript, except that primitive values are forgotten, and 
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hence conditionals are not evaluated; we refer the reader 
to work on operational semantics of JavaScript [27, 18]. 

Apart from tracking heap structures, the abstract heap 
also records explicit-flow dependencies to heap nodes, 
and the rules for updating flows naturally depend on the 
program’s semantics. Also, as we talk about in more 
detail below, there are some aspects of the heap (such 
as prototype fields) that are not currently supported in 
our tool. The static analysis itself is flow-sensitive and 
context-sensitive, and the context-sensitivity is handled 
using classical function-summary based methods. 

The above choices, namely the choice of abstract 
heaps, and the context-sensitive flow-sensitive analysis, 
are design choices we have made, based on our exper- 
iments with extensions for over a year, and were moti- 
vated to reduce false positives. However, we have not 
tried all variants of these choices, and it is possible that 
other choices (for example, choosing to bound abstract 
heaps by merging objects created at a program site), may 
also work well on extensions. However, we do know that 
context-sensitivity 1s important (in several extensions we 
manually examined) and further flow-sensitivity seems 
important if the tool is extended to consider sanitization 
routines as flow-stoppers. 

The rest of this section is structured as follows. First 
we explain our analysis using abstract heaps for a core 
subset of JavaScript, which does not have statements like 
eval, associative array accesses, calls to Firefox APIs, 
etc. Subsequently, we describe how we handle the as- 
pects not covered in the core. 


4.1 Analysis of a core subset of JavaScript 


Core JavaScript: A core subset of JavaScript is given 
in Figure 2; this core reflects the aspects of JavaScript de- 
scribed above, but omits certain features (such as eval) 
which we will describe later. 


Abstract Heaps: Our analysis keeps track of a one ab- 
stract heap at each program point. This abstract heap 
tracks JavaScript objects and functions and the relation- 
ships between them in the form of a graph. Each node 
in the graph is a heap location generated by the program. 
Two different nodes, n; and m2 are connected by an edge 
labeled f, if node n1’s property f may refer to ng. To 
keep track of the actual information flows between differ- 
ent program variables, we also keep track of all the pro- 
gram variables that flow into the nodes in abstract heap. 
Let PVar be the set of all the program variables in the 
JavaScript program. 

More precisely, an abstract heap o is a tuple (ns, n,d, 
fr, dm, tm), where: 


e ns is a set of heap locations, 
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EXPRESSIONS ::= 
ae (CONSTANT) 
ee (VARIABLE) 
| wep (FIELD ACCESS) 
|  x.prot (PROTO ACCESS) 
| eope (BINARY OP) 
| this (THIS) 
| “hls Cimekey a Ont (OBJECT LITERAL) 
| function (pi,...,Pn){S} (FUNCTION DEF) 
ee Cremer se (FUNCTION CALL) 
| new f(ai,...,Qn) (NEW) 
STATEMENTS ::= 
| skip (SKIP) 
| S430 (SEQ) 
| vara (VARIABLE DECL.) 
| 22Se (ASSIGN 1) 
| ef =e (ASSIGN 2) 
| if ethen S; else Sp (CONDITIONAL) 
| whileedo Sod (WHILE) 
| return e (RETURN) 


Figure 2: Core JavaScript syntax. 


en € (ns U{L}) represents the current node, and is 
either a node in the heap or the symbol 1, 


e d © PVar represents the subset of program vari- 
ables that flow in to the current node n, 


e fr C ns x PVar x (ns U {L}) encodes the 
pointers representing properties (fields). A triple 
(n1, f,n2) © fr means that the property f of the 
object n1 may be located at no. 


e dm C ns x PVar is arelation that denotes a depen- 
dency map. A pair (n1,x) € dm denotes that the 
program variable x flows into the node ny. 


e tm: ns X ns is a “this-map” relation, which is actu- 
ally the relation of a function. A pair (n1,n2) € tm 
means that the scope of nj 1S No. 


Notation: The relation tm will always be a function; we 
define formally the function tm: ns > ns as tm(n) = 
n’, where (n,n’) € tm. Let dm: ns > 2°" be the 
function that corresponds to the relation dm, dm(n) = 
{x|(n, 2) € dm}, 1.e. the set of all the program variables 
that flow into the node n. 


The Analysis: We now describe our analysis for the 
core subset of JavaScript. WEX handles functions and 
objects by creating a node for every object or func- 
tion and their properties. Relationships between various 
nodes are accurately generated and tracked in the anal- 
ysis. JavaScript uses prototype-based inheritance; how- 
ever, our analysis does not track prototypes. Instead, a 
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new property insertion into the prototype field of an ob- 
ject is treated as if the property is being inserted into the 
object itself. We found that this is sufficient in case of 
JavaScript extensions as the inheritance chain is not deep 
in most cases. VEX keeps track of the accurate scope 
information using the this-map. 

Our analysis consists of a set of rules for generating 
abstract heaps at program points, and is defined by es- 
sentially capturing the effect of statements on the abstract 
heap. These rules follow a big-step operational seman- 
tics adapted to work on the abstracted heap. 

The big step operational semantics on abstract heaps 
is defined as a relation , (Prog,c) |) o’, where Prog is 
an program expression or statement and o and o’ are ab- 
stract heaps. Such a relation intuitively means that o’ is 
the heap obtained from the complete evaluation of Prog 
starting from the heap o. This resulting heap, in every 
iteration, will be merged with the current heap after the 
program, conservatively taking the union of dependen- 
cies. 

We now define this relation for expressions and state- 
ments. 


Notation: For any abstract heap oa, let o = (nS,, ng, 
dz, fr,, dm>, tm,). In other words, n, refers to the 
second component of o, etc. The function fresh() cre- 
ates a new heap location. A special node nq repre- 
sents the global heap, which consists of the objects like 
Object, Array, etc. 


Evaluating expressions: 
Figure 3 gives the rules for evaluating expressions in the 
program. 

Rule (CONSTANT) evaluates to a | node with empty 
dependencies. Rule (THIS) extracts the scope of the cur- 
rent node. The next five rules describe the variable and 
field access expressions. 

In case of a variable access, the existence property x 
is checked in the current scope (represented by n,(rule 
(VAR))), and returned if it exists. If it is not in the cur- 
rent scope, then the global node (rule (GLOBAL VAR)) 
is checked for property x. If it exists, then it is returned 
with dependencies. If the location for a particular vari- 
able is found in neither the current scope nor the global 
scope, using rule (UNINITIALIZED VAR) we create a 
new node nnew and add it to the global scope. Similar 
rules apply for field accesses in rules (FIELD ACCESS) 
and (UNINIT FLD). 

For binary operators(rule (BINARY OP)), we return 
the union of dependencies of both the expressions. When 
an object literal expression((OBJ. LIT.)) is encountered, 
a summary is computed by recursively creating heap lo- 
cations for each of its properties and then creating the 
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= (FUN-CALL2) 


(f(e1, prs Cn )eO) 1 (nso, aby U dg,, io; dm, tmo)) 
wl 


Figure 3: Semantics for all core expressions except new. 


graph where the new object node is linked to the proper- 
ties with the labeled edges. 


A function definition((FUN-DEF)) is treated in a simi- 
lar fashion as the object literal, except that new summary 
locations are created for each of the function arguments 
and also for the return variable (i.e. n=®®"). The function 
body is evaluated with respect to the new heap. The re- 
sult of the evaluation is the new heap with the function 
summary attached to the node n-"="_ A function call(rule 
(FUN-CALL1)) uses this summary to compute the node 
and dependencies of the return value. The return value 
of the function can be obtained by evaluating each of the 
function argument expressions, and replacing the appro- 
priate nodes in the function summary with the values re- 
turned. If the function is not defined, then the dependen- 
cies of the return values are the union of dependencies of 


the individual function parameters(rule (FUN-CALL2)). 
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A constructor expression (containing new) is similar to a 
function call, where if the object being instantiated 1s re- 
trieved from the local or the global scope, then a copy of 
the graph starting with this object is created and returned. 


Evaluating statements: 

The statement semantics are given in Figure 4. A vari- 
able declaration(VAR. DECL.) creates a new node in 
the current scope. If the heap node for that variable al- 
ready exists, it is replaced by this new node. The as- 
signment statement (rules (ASSIGN1) and (ASSIGN2)) 
evaluates the left hand side and the right hand side ex- 
pressions, replaces the node on the left hand side with 
the node on the right hand side. Note that conditionals in 
if-then-else and while statements are, of course, 
not evaluated as our heaps are symbolic. The while state- 
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Figure 4: Statement semantics. 


ment is interesting: we evaluate the while body till we 
reach a fixed point (or till we reach a fixed number of 
loop un-rollings) as depicted in (WHILE2). However, 
notice that the abstract heap is also allowed to immedi- 
ately go across a while-loop (WHILE1). The semantics 
for the rest of the statements is standard. 

Given the above rules for abstract heaps, we start ana- 
lyzing the JavaScript program using an initial state con- 
sisting of a global heap, represented by node ng. This 
global heap consists of summaries for a few built-in ob- 
jects like Array. We evaluate the rules either till we 
converge on a least fixed-point, or till we reach a preset 
bound on the number of iterations. 


4.2 Handling other features of JavaScript 


Dynamic code: The eval method in JavaScript allows 
execution of dynamically formed code, and is widely 
used in browser extensions. While an accurate analysis 
of the structure of dynamically created code is a research 
topic in itself, and quite out of the scope of this paper, 
we cannot simply ignore eval statements. Our approach 
has been to implement a static constant-string analysis 
for strings and subject the strings that are eval-ed to this 
analysis. Our static analysis engine inserts these constant 
strings into the code (as though it was static code), parses 
it, and computes the flows for them. Strings that are not 
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statically known but subject to eval are essentially ig- 
nored, and this causes our tool to be unsound (see a later 
note on unsoundness). In most correct extensions, an 
eval-ed statement is dynamically chosen from a set of 
constant-strings or taken from trusted sources. Note that 
if there is a flow from an untrusted source to an eval, 
VEX will catch this flow, as it is a vulnerable flow pat- 
tern. 


innerHTML: Modifications of the innerHTML of an 
HTML page by the extension makes the analysis con- 
siderably more complex. For instance, if a function 
a() calls function b() that calls function c(), and 
c () makes innerHTML modifications, it is hard to sum- 
marize this effect in the summary of c () , as the source 
of the flow is not locally available. We handle this by cre- 
ating a symbolic representation of the source, computing 
summaries of innerHTML using this symbolic source, 
and allowing outside methods to instantiate the symbolic 
source to a concrete source in whichever context it be- 
comes available. 


Object properties accessed in the form of associative 
arrays: In JavaScript, objects are treated as associative 
arrays. This means that any property of the object can be 
accessed using the array notation. Array indices could 
be constant strings, which are then evaluated to get the 
actual property being accessed; or they could be num- 
bers, which indicate the property number that is being 
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accessed; or they could be variables, that could be in- 
stantiated at run time. VEX treats these cases in a con- 
servative manner. Whenever a property is created in the 
node scope, its dependencies are added to the dependen- 
cies of the node as shown in the (ASSIGN 2) rule in the 
Figure 4. If we cannot evaluate the array index for any 
reason, it would be sufficient to retrieve the dependencies 
of the object. 


Functions that take arbitrary number of arguments: 
Some functions in JavaScript can have variable numbers 
of arguments. For example, the push method of the ar- 
ray can be called with any number of arguments and the 
arguments will be appended to the end of the array. To 
handle this, the summary of the push method has a spe- 
cial field indicating that it can take variable number of 
arguments and when the method is called, we conser- 
vatively append the dependencies of all the arguments 
to the dependency set of the node representing the array 
object. 


Browser’s DOM API and XPCOM components: 
These objects are treated as uninitialized variables, 
fields and functions. The rules (UNINITIALIZED VAR), 
(UNINIT FLD) and (FUN-CALL2) can be applied to 
their accesses. When we need to keep track of the usage 
of certain components we introduce the component 
API function arguments into the dependency set. For 
example the RDF datasource is accessed using the 
following command: 


rdf = Components.classes 
[“@ mozilla.org/rdf/rdf-service; 1”’] 
.getService(Components.interfaces.nsIRDFService); 


Our analysis introduces the string 
“@mozilla.org/rdf/rdf-service;l” and the variable 
nsIRDFService into the dependency set of the left hand 
side variable rdf. 


4.3 Unsoundness and incompleteness 


A static analysis tool like VEX is inherently conservative. 
First, if VEX reports a flow, there may be no such feasible 
flow in the program (i.e. VEX can have false positives). 
Though VEX over-approximates flows and tries to per- 
form a sound analysis, there are several aspects of the 
analysis which, if implemented soundly, will make the 
tool throw too many infeasible flows, making it useless 
in practice. 

Consider a program where there is an eval of a string 
that is dynamically created and not determinable stati- 
cally. Since this string can be assigned any value, it could 
be any arbitrary program that can create flows between 
any of the variables in scope. A sound tool must nec- 
essarily summarize the eval as causing flows from all 
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variables to all nodes, which would generate plenty of 
false positives and would essentially be useless. False 
negatives (i.e. miss detecting programs that have a flow), 
are also possible because of the fact that we have several 
uninitialized and unsummarized objects. 

VEX has several sources of unsoundness and incom- 
pleteness: handling of eval, handling of prototypes, 
handling of higher-order functions, fixed number of un- 
rolls of loops, handling with-scoping, handling excep- 
tions, etc. 


5 Evaluation 


5.1 VEX implementation 


The VEX tool checks for two kinds of flows: one from 
injectable sources to executable sinks to check for script- 
injection vulnerabilities, and the other, also modeled as 
flows, that checks for unsafe programming practices. 
VEX is implemented in Java (~ 2000 LOC), and uti- 
lizes a JavaScript parser built using the ANTLR parser 
generator for the JavaScript 1.5 grammar provided by 
ANTLR [1]. ANTLR outputs Java-based Abstract Syn- 
tax Trees (AST) for JavaScript files, and VEX walks 
through the ASTs computing the flow sets from all in- 
teresting sources to all interesting sinks, in a single pass 
analysis, using the static analysis described in Section 4. 
For each sink object, VEX collects all the source objects 
that flow into it and checks for the occurrence of flow 
patterns. VEX reports these flows to the user along with 
the source and sink locations in the code. 


Flow patterns checked: The current version of VEX 
checks for the following three flow patterns that capture 
flows from injectable sources to executable sinks: 


- Content Doc to Eval: The source location is any point 
where the program accesses the API 
window.content.document, and the source 
object is the object that is returned from this call. 
The sink locations are eval statements and the sink 
objects are the objects being eval-ed. 


- Content Doc to innerHTML: The source location 
and source objects for these flows are the same 
as above; the sink locations are the places where 
the extension writes directly into the DOM us- 
ing innerHTML commands, and the sink objects 
are the objects being assigned by the innerHTML 
command. These DOM elements may be exe- 
cutable if they are in the chrome context. 


- RDF toinnerHTML: The source location and source 
objects are given by any retrieval of RDF objects 
(which are often injectable) and the sink locations 
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and sink objects are innerHTML commands as 
above. 


Furthermore, VEX searches for the following patterns 
that characterize two documented unsafe programming 
practices that could lead to security vulnerabilities: 


- evalInSandbox object to == or !=: This flow is 
meant to detect an unsafe programming prac- 
tice where an object retrieved by an eval in 
a sandbox is subject to an == or != test (the 
recommended practice is that such objects must 
be tested with ===). The source location is hence 
any evalInSandbox-statement and the corre- 
sponding source objects are the objects returned by 
the eval InSandbox call. The sink locations are 
usages of == and !=, and the sink objects are the 
objects that are subject to these comparisons. 


- Method Call on wrappedJSObject: Objects ob- 
tained using wrappedJSObject() commands are 
usually untrusted, and methods of such objects 
should not be called. The source locations are hence 
uses of wrappedJSObject() and source objects are 
the objects returned by them. Sink locations are 
methods calls and the sink objects are the objects 
whose methods are called. 


The VEX tool can, of course, be adapted to other kinds 
of suspect flows — source and sink locations are straight- 
forward, and the source and sink objects must be speci- 
fied carefully as above. 


5.2 Evaluation methodology 


The extensions we analyzed were chosen as follows. 
First, in October 2008, we built a suite of extensions 
using a random sample of 1827 extensions from the 
Mozilla add-ons web site, by downloading the first exten- 
sions in alphabetical order for all subject categories. In 
November 2009, we downloaded 699 of the most popular 
extensions. The two sets had 74 extensions in common, 
for a total of 2452 extensions. Our suite includes multi- 
ple versions of some extensions, allowing cross-version 
comparisons. For instance, we found a vulnerability in a 
new version of BEATNIK (see Section 5.4), though its au- 
thors thought the vulnerabilities in the previous version 
were fixed. 

We extracted the JavaScript files from these extensions 
and ran VEX on them, using a 2.4GHz 64 bit x86 proces- 
sor with a maximum heap size of 4GB for the JVM. 


5.3. Experimental results 


Finding flows from injectible sources to executable 
sinks: Figure 5 summarizes the experimental results 
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for flows that are from injectible sources to executable 
sinks (the first three flows outlined above). The first 
column is the number of extensions that syntactically 
has code that could indicate such a flow, identified 
using a grep-search. For the flow “Content-doc to 
Eval’, the grep was for the string ‘eval(’; for “Content- 
doc to InnerHTML’ flows, the grep was for the string 
‘innerHTML’; and for “RDF to InnerHTML’ flows, 
the search was for both the strings “‘innerHTML” and 
‘“@mozilla.org/rdf/rdf-service;1”. As the table shows, 
this search finds hundreds of suspect extensions, far more 
than can be examined manually. 

The third column indicates the number of extensions 
on which VEX reports an alert with corresponding flows. 
On an average, VEX took only 15.5 seconds per exten- 
sion. 

To look for potential attacks, we manually analyzed 
most of the extensions with suspect flows that VEX 
alerted us on, spending about two hours per extension 
on average. 

The next column reports the number of extensions on 
which we could engineer an attack based on the flows 
reported by VEX. We were able to attack six extensions, 
of which only three extensions were already known to 
be vulnerable. The attacks on Wikipedia Toolbar, Fizzle 
version 0.5.1 and Fizzle version 0.5.2 extensions are new, 
see Section 5.4 for more details. 

The next column shows the extensions where the 
source 1S code from a web site, and where an attack is 
possible provided the web site can be attacked. In other 
words, these extensions rely on a trusted web site as- 
sumption (e.g., that the code on the Facebook website 
is safe). We think that these are valid warnings that users 
of an extension (and Mozilla) should be aware of; trusted 
web sites can after all be compromised, and the code on 
these sites can be changed leading to an attack on all 
users of such an extension. 

Not all flows lead to attacks — the next set of columns 
describe the alerts that we were unable to convert to con- 
crete attacks. Some flows were not exploitable as the 
input is sanitized correctly (either by the extension or the 
browser), preventing JavaScript injection, while others 
were not exploitable as the sinks do not turn out to be 
chrome executable contexts. These extensions are noted 
in the next two columns. Finally, VEX, being a conser- 
vative flow-analysis tool, does report alerts about flows 
that do not actually exist— there were very few of these, 
and are noted under the column “Non-existent flows”. A 
discussion on flows that do not lead to attacks is given in 
Section 5.5. 

As noted in the last column, there were 13 extensions 
with VEX alerts that were too complex(or obscurely writ- 
ten) for us to manually analyze for an attack; we do not 
know whether attacks on these are possible or not. 
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Flow Pattern grep VEX | Attackable | Source is Not Attackable Unanalyzed 
Alerts | Alerts | Extensions trusted | Sanitized | Non-chrome | Non-existent 
website input sinks flows 


[ConientDoctoeval | #0] 3[ 2 | 1] of 3] 3]? 
[Content Docto imeraui |[ 534 46[ 0 | | 6| «itt 
PRDFtoimerimt ‘| oo] 4[ 4*[ of of of oj 0 


Attackable Extensions: * WIKIPEDIA TOOLBAR V-0.5.7, WIKIPEDIA TOOLBAR V-0.5.9 , 
** RIZZLE V-0.5, FIZZLE V-0.5.1, FIZZLE V-0.5.2 & BEATNIK V-1.2 





Figure 5: Flows from injectible sources to executable sinks. 





Unsafe Programming Practices 


[grep Alerts | Vex Alens_ 





evalinSandbox Objectto==or!= || +‘1o7|.~~SO 


Method Call on wrappedJSObject y 269 | 144 


Figure 6: Results for unsafe programming practices. 


Finding unsafe programming practices: 

The results of the second set of experiments for flows 
that characterize the two unsafe programming practices 
of checking equality on objects evaluated in a sandbox 
and calling methods of unwrapped JS objects are shown 
in Figure 6. 

The first column denotes the flow-pattern, the second 
column shows the number of extensions that had a grep 
pattern for the strings ‘evalInSandbox’ and ‘wrapped- 
JSObject’, respectively. The third column shows the 
number of extensions that VEX alerts. Note that these 
flows correspond to unsafe programming practices de- 
clared by Mozilla for extension writers, and hence should 
be avoided. We analyzed 15 of the alerts and found that 
all of the flows we inspected were feasible and real, but 
we were unable to manually confirm the remainder be- 
cause there were too many alerts to examine. 


5.4 Successful attacks 


Attack scripts: All our attack scenarios involve a user 
who has installed a vulnerable extension who visits a ma- 
licious page, and either automatically or through invok- 
ing the extension, triggers script written on the malicious 
page to execute in the chrome context. Figure 7 illus- 
trates an attack payload that can be used in such attacks: 
this script displays the folders and files in the root di- 
rectory. The attack payloads could be much more dan- 
gerous, where the attacker could gain complete control 
of the affected computer using XPCOM API functions. 
More examples of such payloads are enumerated in the 
white-paper given in [13]. 

Let us now explain the various attacks we found on 
web extensions: 


Wikipedia Toolbar, up to version 0.5.9 
If a user visits a web page with the directory display 
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<SCript > 

var root = Components.classes 

["emozt lle corg/ files locals 1" | .ereatelnst ance 
(Components.interfaces.nsILocalFile); 

try { 

POO I EWTEhP ath Oya ie of. for Lanse or. Mac 
catch (er) { 

root.initWithPath("\\\\."); // for Windows 

} 

var drivesEnum = root.directoryEntries, 
drives = []; 

while (drivesEnum.hasMoreElements()) { 
drives.push (drivesEnum.getNext (). 
QueryInterface (Components.interfaces. 
nsILocalFile).path); 


alert (drives); 
</ Script > 





Figure 7: Attack script to display directories. 


attack script in its <head> tag, and clicks on one of 
the Wikipedia toolbar buttons (unwatch, purge, etc.), the 
script executes in the chrome context. The attack works 
because the extension has the code given in Figure 8 in 
its toolbar.js file. 


script = window._content.document. 


getElementsByTagName (*‘**script") [0] .innerHTML; 
eval (Script); 





Figure 8: Wikipedia toolbar code. 


The first line gets the first <script> element from the 
web page and executes it using eval. The extension de- 
veloper assumes the user only clicks the buttons when 
a Wikipedia page is open, in which case <script> may 
not be malicious. But the user might be fooled by a ma- 
licious Wikipedia spoof page, or accidentally press the 
button on some other page, VEX led us to this previ- 
ously unknown attack, which we reported to the devel- 
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opers, who acknowledged it, patched it, and released a 
new version. This resulted in a new CVE vulnerability 
(CVE-2009-41-27). The fix involved inserting a condi- 
tional in the program to check if the url of the page is 
on the wikipedia domain and evaluating the script only if 
this is true. 


DOOKINBERS IS: 
. function Bookmarks(){ 


1 

2. var bookmarks=new Array(); 

3. this.load = function({ 

4. bookmarks=new Array(); 

5. var rdf = Components.classes[ 
“@mozilla.org/rdf/rdf-service; 1”’] 
.getService(Components.interfaces.nsIRDFService); 

6. var bmds = rdf.GetDataSource(’’rdf:bookmarks’’); 

7. var iter = bmds.GetAllResources(); 

8. while (iter.hasMoreElements()){ 

9 var element = iter.getNext(); 

10. bookmarks.push( 

{name:element.name, url:element.url }); 


11 ht 


SYS.jS: 

12. var sysS=new Sys(); 

13. function Sys() { 

14. var bookmarks = null; 

15. this.startup = function() { 

16. bookmarks = new Bookmarks(); 
17. bookmarks.loadQ; 

18. ui.buildFeedList(); } 

19. this.getBookmarks(){ 

20. return bookmarks; } } 


Ui.js: 

21. var ui=new Ui(); 

De function Ui() { 

25: this.buildFeedList = function() { 
24. var bm=sys.getBookmarks(); 
29; for (var i=0;i<bm.size(); i++) { 
26. var mark = bm.get(i); 

at html += <p> mark.name; } 

28. div.innerHTML = html; } } 





Figure 9: FIZZLE vulnerability code. 


Fizzle versions 0.5, 0.5.1, 0.5.2 

FIZZLE is a RSS/Atom feed reader that uses Livemark 
bookmark feeds. Vulnerability report CVE-2007-1678 
explains that FIZZLE VER.O.5 allows remote attackers 
to inject arbitrary web scripts or HTML via RSS feeds. 
FIZZLE’s RSS feeds are obtained from the bookmarks’ 
RDF resource, using the XPCOM RDF service. The au- 
thor of FIZZLE purportedly fixed this vulnerability in the 
next version; however, VEX signaled the presence of a 
flow, and we found that the sanitization routine that the 
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programmer wrote was flawed, and the extension can 
be attacked using suitably encoded scripts. These new 
attacks for FIZZLE VER 0.5.1 and FIZZLE VER 0.5.2 
were not known before, to the best of our knowledge. 

Figure 9 gives a highly simplified version of FIZ- 
ZLE, to show its information flows. When the user 
clicks on the FIZZLE extension toolbar to see the feeds, 
FIZZLE 1s initialized, 1e., sys.startup() on line 
15 is called. This method loads the bookmarks from 
the Firefox bookmarks folder. The title and URL of 
the feeds are obtained from the bookmarks’ RDF re- 
source and then stored in an array in FIZZLE when 
bookmarks.load() is called. After the bookmarks 
are loaded, ui. buildFeedList () 1s called. In this 
method, the bookmark array is accessed on line 24 and 
the elements are added to a variable named html on 
line 27. This html variable is then assigned to the 
innerHTML property of the (div) tag of an HTML page. 
This page is then displayed in a frame in the browser. 
The attack happens when a malicious RDF file is loaded, 
where the name element of the feed contains JavaScript. 
Assigning a specially crafted script to the innerHTML 
property at line 28 results in the script being executed 
under chrome privileges. 

To detect this kind of attack, we must be able to deter- 
mine that the information that flows into the html vari- 
able and eventually into the innerHTML property is from 
the bookmarks’ RDF resource. It is difficult to detect this 
manually, because most extensions are encoded in many 
separate JavaScript files spread across multiple directo- 
ries, and the routines defined in these files have complex 
interactions with each other. Even the example shown 
in Figure 9 is spread over three different JavaScript files, 
and we have omitted many lines of code from the func- 
tions shown. As mentioned earlier, VEX users can define 
summaries for library functions, or just rely on default 
summaries. Given a function summary for the push 
method of the Array object defined in the XPCOM li- 
brary, VEX detects that FIZZLE has flows from the RDF 
service to innerHTML. 


Beatnik version 1.2 

BEATNIK 1s another RSS reader with the same kind of 
problematic flow as FIZZLE, documented in CVE-2007- 
3110 for BEATNIK version 1.0. In the Mozilla add-ons 
page for the subsequent version of BEATNIK, the exten- 
sion developer said he had sanitized the RSS feed input. 
VEX found that there were still flows from the book- 
marks’ RDF to the innerHTML property in BEATNIK 
version 1.2, because VEX currently does not consider 
declassification via sanitization. Our manual examina- 
tion showed the new sanitization to be inadequate. The 
sanitization parses the feed input and checks whether the 
nodes contain script. If the feed contains only text nodes, 
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it is appended to the RSS feed title; otherwise it is dis- 
carded. By encoding the ( and ) tags as their HTML 
entity names, we can fool this routine. If we name the 
RSS feed as follows: 


Title &lt; /a &gt&lt; img src = 
&quot;&quot; onerror= ’CODE FROM’ FIGURE 7’ 





& gt; Beatnik &1lt;/imgégt; &lt; a &gt; 


the string is converted into 
Title </a> < img src =" " onerror= '’CODE 
FROM FIGURE 7’ > Beatnik </img> <a> 


and results in an attack. To the best of our knowledge, 
this attack has not been reported thus far. One must un- 
derstand the extension code to form these attack strings; 
in this case, the <a> tag had to be closed at the begin- 
ning of the string and opened again at the end for the 
script to work. 


5.5 Flows that do not result in attacks 


Figure 10 gives several examples of the suspect flows 
that we manually analyzed and for which either trusted 
sources were assumed by the extension or we could not 
find attacks. 

The first set has extensions reading values from web- 
sites or sources it trusts, and the values flow to eval, 
innerHTML, or evalInSandbox. Of course, if the 
trusted sources are compromised, then the extensions 
may become vulnerable. 

The second set illustrates examples where the input 
was sanitized between the source and the sink (we do 
not know for sure that the sanitization is adequate, but 
we were unable to attack it). The third set of extensions 
had non-chrome sinks. The last two examples show false 
positives where the flows reported by VEX do not exist. 
These false alarms are because of the way VEX handles 
variable dependencies imprecisely. For example, the last 
alarm is caused by the rule ASSIGN2 in Figures 3 and 4, 
which conservatively adds the dependencies of variable 
x to field f. 


6 Related work 


Maffeis et. al. [27] proposed a small-step operational 
semantics for JavaScript, using which they analyze se- 
curity properties of web applications. This operational 
semantics is then useful for generating safe subsets of 
JavaScript and to manually prove that the so-called safe 
subsets of JavaScript are in fact vulnerable to certain 
attacks [28]. Our operational semantics is inspired by 
their approach, although we take an alternate approach 
of abstracting the primitive values in the program. This 
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helps us in proposing a precise information flow analy- 
sis approach for a non-trivial JavaScript program. More 
recently, Guha et. al. [18] also provide an operational 
semantics for JavaScript (albeit without semantics for 
eval) with the goal of making it easier to prove properties 
about the JavaScript programs. 

Recent work by Ter Louw et al. [25] highlights some 
of the potential security risks posed by browser exten- 
sions, and proposes run time support for restricting the 
interactions between browsers and extensions. Our tech- 
niques are complementary to these techniques since, as 
our experiments show, even restricted interfaces can still 
be susceptible to security vulnerabilities. 

Most recent work on the security of browser exten- 
sibility mechanisms focuses on plugin security. Plug- 
ins are external applications hosted within the browser 
that are used to render non-HTML content, such as Flash 
videos. The first work to examine security issues for 
browser plugins was Janus [14], which discusses sand- 
boxing techniques for browser-helper applications, such 
as PDF viewers. More recently, the OP [15] and Gazelle 
[16] web browsers tackle this same issue by applying 
many of the principles from the original Janus work to 
modern browser plugins. 

The general idea of secure extensibility has been stud- 
ied by the systems community with projects that focus 
on providing secure extensions for operating systems 
via type safe programming languages [5, 31, 36], proof- 
carrying code [29], new OS abstractions [10], and soft- 
ware fault isolation [11]. To date, none of these tech- 
niques have been adapted to address the special security 
needs of web browser extensibility mechanisms. 

Static information flow analysis has been used in a 
number of previous projects. The work proposed in [2] 
tracks whether various variables in the program are 1n- 
dependent from each other both through explicit and im- 
plicit flows. Researchers have employed static analysis 
for web applications with the goal of identifying and 
preventing cross-site scripting attacks [26]. For exam- 
ple, Pixy [21] is a taint based static analyzer for PHP 
that detects flows; WebSSARI [19] offers similar facili- 
ties. Vogt et al. [32] propose combining static and dy- 
namic techniques to prevent cross-site scripting. Xie 
and Aiken propose a static analysis of PHP for SQL in- 
jection vulnerabilities [34]. Livshits and Lam develop 
flow-insensitive static analysis tools for security proper- 
ties [24]. 

More recently, researchers have developed a 
flow-insensitive static information flow methods for 
JavaScript [7, 17]. In contrast, VEX’s analysis is 
flow-sensitive and context-sensitive. In [7] the authors 
essentially perform a flow-insensitive static analysis 
on the code, and delegate analysis of dynamic code to 
runtime checks. Furthermore, their analysis is context- 
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Classification | Extension = ===~~—~—~—~—_|_ Flow pattern/Unsafe practice | Explanation 


TWITZER_-_TWITTER_MORE! v.1.3. | Content Doc to innerHTML Works only when on Twitter 
Source is trusted : 
sates ANSWERS v-2.3.50 Content Doc to innerHTML Works only on answers.com 


MYSPACE_FRIEND_RENAMER v-.86 | Content Doc to innerHTML Fetches friend names from prefs.js, where 
they are stored during instantiation 


ee GIRL_IN-WONDERLAND v-0.808 Content Doc to innerHTML Assigns a Flash URL to innerHTML of an el- 
Sanitized Input 

ement on the page, and sanitizes the URL be- 

fore assignment; is the sanitization complete? 


AUTOSLIDESHOW v-0.3.4 


Non-chrome 


Non-existent POWER_TWITTER v-1.37 


flows 





Content Doc to eval 


Content Doc to innerHTML Has a flow from the image name urls to 


innerHTML. The extension did not sanitize 
the inputs in any way. However, the Firefox 
DOM API methods encoded the urls when 


they were being handled by the extension. 


UNHIDE_FIELDS v-0.2e Content Doc to innerHTML Creates a frame on top of the current content 
sinks document and displays the hidden fields in a 
page in that frame 
WEB_DEVELOPER v-1.1.6 Content Doc to innerHTML Generates a non-chrome document in a new 
tab or window and appends the stylesheet in- 
formation of a page as a node in this page 


Has document, content and window depen- 
dencies, but they are chrome elements, not 
content 


Caused by the ASSIGNT rule 


Figure 10: Example extensions. 


insensitive, which could generate a lot of false-positives 
if used for analyzing browser extensions. VEX does 
not delegate any work to runtime checks. Guarnieri 
et. al. [17] popose a mostly-static enforcement for 
JavaScript analysis. Their threat model is that of a 
malicious JavaScipt widget that could run in the same 
page as a hosting site and which may contain code 
obfuscation. Their policies are based on searching for 
forbidden objects or methods in the code which requires 
an accurate pointer analysis which they define. 


Several dynamic analysis techniques with static instru- 
mentation have been proposed for JavaScript to check 
information-flow properties [35, 22]. SABRE [9] is a 
framework for dynamically tracking in-browser informa- 
tion flows for analyzing JavaScript-based browser exten- 
sions. We believe that dynamic techniques are not the 
best choice for vetting web extensions, as we think it is 
best to analyze extensions statically before they are un- 
leashed on ordinary users. However, dynamic techniques 
that prevent certain script injection attacks can be useful 
when enforced by the web browser. The drawback is 
that the web browser must choose an appropriate action 
to take when it detects a questionable flow. Querying 
the user may not be wise, and default options may be- 
come too restrictive. Additionally, SABRE imposes a 
performance and memory overhead to the browser be- 
cause of the need to keep track of the security label for 
every JavaScript object inside the browser. 

Recently, Freeman and Liverani from Security Assess- 
ment have written a white paper [12] detailing the pos- 
sible attacks on Firefox extensions. We are currently in 
the process of extending VEX to incorporate some of the 
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source/sink pairs shown in that paper. 


7 Conclusions 


Our main thesis is that most vulnerabilities in web ex- 
tensions can be characterized as explicit flows, which 
in turn can be statically analyzed. VEX is a proof- 
of-concept tool for detecting potential security vulner- 
abilities in browser extensions using static analysis for 
explicit flows. VEX helps automate the difficult man- 
ual process of analyzing browser extensions by identify- 
ing and reasoning about subtle and potentially malicious 
flows. Experiments on thousands of extensions indicate 
that VEX is successful at identifying flows that indicate 
potential vulnerabilities. Using VEX, we identify three 
previously unknown security vulnerabilities and three 
previously known vulnerabilities, together with a variety 
of instances of unsafe programming practices. 

The most important future direction we envision is to 
extend the VEX analysis in three ways. First, the static 
analysis can benefit from a points-to analysis that is more 
precise on certain aspects of JavaScript such as higher- 
order functions, prototypes, and scoping. The second 
important extension is to define a more complete set of 
flow-patterns (sources and sinks) that capture vulnera- 
bilities. In preliminary work, we have found 16 more 
known vulnerabilities, of which 14 can be characterized 
using information flow-patterns. Identifying statically 
these source-sink pairs and adding them to VEX would 
yield a more comprehensive tool. In the direction of re- 
ducing false positives, automatically building attack vec- 
tors for statically discovered flows can help synthesize 
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attacks; a key challenge in achieving this would be in 
handling sanitization routines effectively [3, 30]. 
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Abstract 


Web browsers are increasingly designed to be ex- 
tensible to keep up with the Web’s rapid pace of 
change. This extensibility is typically implemented 
using script-based extensions. Script extensions 
have access to sensitive browser APIs and content 
from untrusted web pages. Unfortunately, this pow- 
erful combination creates the threat of privilege es- 
calation attacks that grant web page scripts the full 
privileges of extensions and control over the entire 
browser process. 

This paper makes two contributions. First, it 
describes the pitfalls of script-based extensibility 
based on our study of the Firefox web browser. We 
find that script-based extensions can lead to arbi- 
trary code injection and execution control, the same 
types of vulnerabilities found in unsafe code. Sec- 
ond, we propose a taint-based system to track the 
spread of untrusted data in the browser and to de- 
tect the characteristic signatures of privilege escala- 
tion attacks. We evaluate this approach by using ex- 
ploits from the Firefox bug database and show that 
our system detects the vast majority of attacks with 
almost no false alarms. 


1 Introduction 


Most web browsers today provide powerful exten- 
sibility features, including native and script-based 
extensions. Native extensions (or plugins) are typi- 
cally used when performance is critical (e.g., virtual 
machines for Java, Flash, media players, etc.), while 
script extensions ensure memory safety and have the 
advantage of being inherently cross-platform and 
amenable to rapid development. Examples of pop- 
ular script extensions include the Firefox Adblock 
extension [1] that filters content from blacklisted ad- 
vertising URLs, and Greasemonkey [4] that allows 
users to install arbitrary scripts in web pages for cus- 
tomization or to create client-side mashup pages. 
Script extensions must have access to both sensi- 
tive browser APIs and content from untrusted web 
pages. For example, Adblock must be able to ac- 
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cess the local disk to store its URL blacklist and 
access web pages to filter their content. This com- 
bination is needed for writing powerful extensions, 
but it creates challenges for securely executing web 
page scripts. Specifically, when extensions interact 
with web pages, there is a risk of a privilege escala- 
tion attack that grants web page scripts the full privi- 
leges of script extensions and control over the entire 
browser process. Privilege escalation vulnerabilities 
are perhaps even more critical than memory safety 
vulnerabilities because script-based attacks can of- 
ten be executed reliably. 

Our goals in this paper are two-fold: 1) under- 
standing the nature of script-based privilege escala- 
tion vulnerabilities, 2) proposing methods to secure 
the Firefox browser against them. Privilege esca- 
lation vulnerabilities are common in Firefox, and 
comprise roughly a third of the critical vulnerabil- 
ity advisories. They arise from unsafe extension be- 
haviors or bugs in the Firefox security mechanisms 
that regulate interactions between trusted native or 
extension scripts and untrusted web page scripts. 
These vulnerabilities have appeared regularly in ev- 
ery version of the browser and exist even in the lat- 
est versions. This is despite continuing effort from a 
dedicated team of security developers that have pro- 
gressively improved the browser security model. 

The Firefox security model consists of a com- 
bination of stack inspection and one-way names- 
pace isolation. The stack inspection mechanism, 
implemented at the boundary of the script and na- 
tive code, regulates accesses to sensitive native in- 
terfaces based on the principals of the caller. For 
example, a local file access is denied if the current 
stack contains a frame associated with an untrusted 
principal.! Namespace isolation is used to enforce 
the same-origin policy for web page scripts. This 
policy limits interactions between scripts and doc- 
uments loaded from different origins. The names- 
pace isolation is one way in that script extensions 


' A principal represents the code’s origin and, for web page 
scripts, it consists of a scheme, host, port combination. 


19th USENIX Security Symposium 355 


356 


are privileged and allowed to access content names- 
paces, but web page scripts should not be able to 
obtain a reference to the privileged namespace. This 
policy is designed to enforce the same-origin policy 
and defend against privilege escalation attacks. 

These security mechanisms are well understood, 
but they have two flaws: 1) relying entirely on prin- 
cipals as a measure of trustworthiness for stack in- 
spection, and 2) depending on one-way namespace 
isolation to work correctly. In practice, an exploit 
can leverage browser bugs or vulnerable extensions 
to confuse the browser into assigning wrong princi- 
pals to code or executing data or code with wrong 
principals, thus defeating stack inspection. Second, 
reference leaks can occur because of interactions 
between privileged and unprivileged scripts, com- 
promising namespace isolation and allowing un- 
privileged scripts to affect the execution of privi- 
leged scripts. As a result, we find that arbitrary code 
injection and execution control vulnerabilities that 
commonly exist in unsafe code can also occur with 
script-based extensibility. 

Based on the flaws described above, our solution 
for securing the Firefox browser consists of com- 
bining tainting with the existing stack-based secu- 
rity model. Our approach guarantees that tainted 
data will not be executed as privileged code. Taint- 
ing all data from untrusted origins and propagating 
the tainted data throughout the browser provides a 
much stronger basis for making security decisions. 
In essence, our attack detectors “second guess” the 
security decisions of the browser by taking into ac- 
count one additional piece of information, 1.e. the 
taint status. This solution is conceptually simple 
and well-suited for the browser’s security model be- 
cause namespace isolation already provides a se- 
curity barrier between the taint sources in content 
namespaces and privileged code residing in exten- 
sion namespaces. As a result, we show that it is un- 
likely that attacks will be detected erroneously, even 
if we fully taint all data and scripts from web pages. 

The contributions of this paper are two-fold: 1) 
we analyze and classify script-based privilege esca- 
lation vulnerabilities in the commonly used Firefox 
browser, 2) we use taint-based stack inspection to 
design effective signatures for script-based exploits 
and evaluate this approach. We use Firefox version 
1.0 for the evaluation because it has several priv- 
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ilege escalation vulnerabilities and easily-available 
exploits. Our results show that we can detect the 
vast majority of attacks with almost no false alarms 
and modest overhead. 

Below, Section 2 provides background on the 
Firefox security model. Section 3 presents our 
classification of privilege escalation vulnerabilities 
and sample exploits. Section 4 describes our taint- 
based approach for securing script-based extensi- 
bility. Section 5 provides an evaluation of our ap- 
proach. Section 6 describes related work in the 
area and Section 7 presents our conclusions and de- 
scribes future work. 


2 The Firefox Browser 


In this section, we provide an overview of the Fire- 
fox architecture and its security model. 


2.1 Architecture 


Figure | shows a simplified version of the Fire- 
fox architecture relevant to this work. The basic 
browser functionality 1s provided by native C++ 
components written using Mozilla’s cross-platform 
component model (XPCOM). XPCOM components 
implement functionality such as file and socket ac- 
cess, the document object model (DOM) for rep- 
resenting HTML documents, and higher-level ab- 
stractions, such as bookmarks, and expose this func- 
tionality via the XPIDL interface layer. The Script 
Security Manager (SSM) is an XPCOM component 
responsible for implementing the browser’s security 
mechanisms. 

The JavaScript interpreter accesses XPCOM 
functionality via the XPConnect translation layer. 
This layer allows the interpreter and the XPCOM 
classes to work with each others data types transpar- 
ently. XPConnect also serves as the primary secu- 
rity barrier for enforcing the browser’s same origin 
policy and restricting access to sensitive XPCOM 
interfaces. 

Firefox’s script extensions and privileged UI 
scripts, shown in Figure 1, are loaded from lo- 
cal files through URIs with the “chrome” protocol. 
They are privileged and have access to a greater 
number of XPCOM interface methods than web 
page scripts and are not subject to the browser’s 
same origin policy. Similar to other browsers, Fire- 
fox also supports native plugins for Java, Flash, etc. 
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Figure 1: The Firefox architecture. 


Although potential security vulnerabilities can ex- 
ist within plugin implementations, we do not ad- 
dress them. However, with appropriate sandboxing 
of plugins [14, 23], we would be able to monitor any 
script interactions with the plugins. 


2.2 Security Model 


Firefox primarily uses two security schemes, 
namespace isolation and a_ subject-verb-object 
model based on stack inspection. Namespace iso- 
lation is used to enforce the same origin policy for 
web page scripts, and stack inspection regulates ac- 
cess to sensitive XPCOM components. We de- 
scribe each in more detail below. 


2.2.1 Namespace Isolation 


The browser runs scripts within an object names- 
pace that defines the objects available to the script. 
A window object lies at the root of the namespace 
for each web page. For example, web page scripts 
manipulate HTML by invoking the DOM methods 
of the document object that is a property of this win- 
dow object. 

The browser enforces the same origin policy by 
running web page scripts from different web pages 
in different namespaces. These scripts are only al- 
lowed to access other namespaces from the same 
origin (described below). Extension scripts are al- 
lowed to access all content namespaces. Extension 
namespaces are hidden from the web page scripts, 
and extensions are not expected to invoke web page 
scripts directly. 


2.2.2 Subject-Verb-Object Model 


Firefox uses a ““Subject-Verb-Object” access control 
model. The subject 1s the principal of the currently 
executing code, the verb is one of a limited number 
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of operations (e.g., call a function F, get a property 
A, set a property B), and the object is the principal 
of the object that is the target of the operation. This 
security mechanism is implemented in the Script 
Security Manager, and invoked by XPConnect to 
regulate access to sensitive XPCOM interfaces and 
by the interpreter to limit access to sensitive func- 
tions and object properties. 

The principal of a web page script is defined by 
the origin of the document containing the script (its 
protocol, hostname, and port). The Script Security 
Manager determines the subject principal by walk- 
ing down the JavaScript stack until it finds a stack 
frame with a script principal. The object principal is 
determined by walking up the object’s parent chain 
(scope chain) in its namespace until an ancestor ob- 
ject with a principal is found. For web pages, the ob- 
ject’s parent chain leads to a top-level HTML docu- 
ment associated with the window object. 


3 Script-Based Privilege Escalation 


Privilege escalation vulnerabilities are created by 
unsafe extension behaviors or bugs in the Firefox 
security mechanisms that regulate interactions be- 
tween privileged and unprivileged code. In this 
section, we first discuss different classes of script- 
based privilege escalation vulnerabilities and then 
describe examples of real vulnerabilities. 


3.1. Vulnerability Classification 


Our analysis of the Firefox bug database revealed 
four main classes of privilege escalation vulnera- 
bilities: code compilation, luring, reference leaks 
and insufficient argument sanitization. Most of the 
known Firefox vulnerabilities can be attributed to 
one or more of these classes. 


3.1.1 Code Compilation Vulnerabilities 


Similar to cross-site scripting (XSS) vulnerabil- 
ities that occur in web sites, code compilation 
vulnerabilities allow arbitrary strings from content 
namespaces to be compiled into JavaScript byte- 
code with privileged principals. Unlike a stati- 
cally typed language such as Java, JavaScript al- 
lows arbitrary strings to be converted into byte code 
at runtime through eval and eval-like functions 
such as set Timeout. The eval function com- 
piles a string into byte code and executes it with 
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the principal of the calling script, even if the string 
was obtained from a different namespace. Code 
compilation vulnerabilities occur if attackers can 
trick privileged code into compiling strings sup- 
plied by the attacker, or if they can find bugs in 
the rules for assigning principals to newly com- 
piled byte code. For example, it can be danger- 
ous for privileged code to load URIs from untrusted 
namespaces as the URIs are capable of carrying 
script code inline. For example, the “javascript” 
protocol (e.g., javascript:alert (’ Hello 
World’ );) allows executing text after the proto- 
col name as a script in the current namespace. 

This problem may seem simple, but it has been 
the cause of several security bugs in Firefox. For 
example, even after vulnerable code was patched to 
sanitize URIs before loading them, exploits were 
possible because they did not account for nested 
URIs such as view-source: javascript :. 


3.1.2. Luring Vulnerabilities 


Luring vulnerabilities allow malicious scripts to 
trick privileged code into calling a privileged func- 
tion of the attacker’s choosing instead of the in- 
tended callee. Stack inspection prevents unprivi- 
leged scripts from calling the privileged functions 
directly, so malicious scripts must lure privileged 
code into making these calls.Luring is possible be- 
cause script extensions routinely access DOM ob- 
jects in content namespaces. These DOM ob- 
jects are simply JavaScript wrappers for native XP- 
COM objects with well-defined, native interfaces. 
However, JavaScript’s flexibility allows web page 
scripts to modify these wrapper objects. In ver- 
sions of Firefox after 1.0.3, privileged code is pro- 
tected by automatically created “safety wrappers” 
that hide any wrapper changes made by untrusted 
code. However, if the safety wrapper code contains 
bugs (as has often been the case), privileged code 
again becomes vulnerable to luring attacks. 

In order to execute privileged code, an attacker 
can choose one of three possible kinds of callees: 1) 
an eval-like native function, 2) a privileged function 
accidentally leaked into the content sandbox (see 
next section), or 3) a privileged native method that 
legitimately exists in content namespaces. The third 
category consists of XPCOM methods that are vis- 
ible to ordinary web page scripts because they are 
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meant to be invoked by digitally signed web page 
scripts. For example, the preference () method 
of the navigator object allows privileged scripts 
to read or write the browser’s configuration, such as 
the browser’s homepage and security settings. Or- 
dinary web page scripts cannot invoke the sensitive 
preference () method directly, but since every 
function is also an object in JavaScript, web page 
scripts can obtain an object reference to this method 
and potentially trick buggy privileged code into in- 
voking the reference. 


3.1.3 Reference Leak Vulnerabilities 


Reference leak vulnerabilities occur when web page 
scripts gain access to references in the extension 
namespace [11]. These leaks are compromises in 
the isolation between privileged and unprivileged 
namespaces. They allow an attacker to modify data 
or code defined in a privileged namespace and call 
arbitrary functions within the privileged namespace, 
potentially leading to arbitrary execution control. 
Reference leaks are dangerous because privileged 
code that depends on namespace isolation may be- 
come accessible to web page scripts or it may be- 
come vulnerable to code compilation or luring at- 
tacks. Reference leaks can occur due to bugs in na- 
tive code that deals with namespaces. Also, careless 
extensions may place references to privileged ob- 
jects in an untrusted namespace. Finally, reference 
leaks can lead to cross-principal confidentiality vio- 
lations, but we do not address confidentiality in this 


paper. 
3.1.4 Insufficient Argument Sanitization 


Vulnerabilities can also occur if a browser extension 
uses unsanitized data from untrusted namespaces as 
arguments to privileged XPCOM APIs. For exam- 
ple, if an extension used to download Flash videos 
from web pages uses the name of the movie file on 
a web page as part of the local filename to which 
the file 1s saved, 1t may be open to directory traver- 
sal attacks (e.g., using “../’ to access normally inac- 
cessible directories) that would not be detected by 
the browser’s stack inspection mechanism. If the 
overwritten file were an extension JavaScript file, it 
would lead to a privilege escalation attack. This spe- 
cific class of vulnerability has not been documented 
in the Firefox bug database, but we consider it a 
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OllubLIKLCOMAVALLADLE: LUMCLLOI(HLEL ) 


Te (Tavleon Be «xc) 4 
TavLCOn -SetALTErTbuULre(™ sre", 
Href) ; 


Figure 2: Target code invoked when a LINK tag is 
found in the current web page. 


likely vulnerability for extensions. 
3.2 Examples 


We describe some examples of privilege escala- 
tion vulnerabilities from the Firefox bug database 
to show that these vulnerabilities can be subtle and 
easy to overlook. 


3.2.1 URI Code Injection 


Figure 2 shows an example of browser JavaScript 
containing a code compilation vulnerability that can 
lead to URI code injection (Bug 290036). This GUI 
code displays a favicon (16x16 pixel icon) image 
next to the browser’s URL bar. Normally the icon’s 
URI, which is specified by the current web page, 
would be the HTTP address of the favicon image, 
but a malicious web page can specify a “javascript” 
protocol URI. When the privileged UI code attempts 
to load the image by setting the src property of 
the icon container to the Href URI, it will inadver- 
tently execute script code. This code will be com- 
piled with the unprivileged principals of the URI, 
but it will have access to the privileged UI names- 
pace, allowing reference leaks, which can then be 
used for other attacks (e.g., see Section 3.2.4). This 
vulnerability occurs because the native code im- 
plementing the icon container and the compilation 
function are unaware of the origin of the Href ar- 
gument. 


3.2.2 Compilation with Wrong Principals 


Figure 3 shows code that exploits a code 
compilation and a reference leak vulnerabil- 
ity to create a dynamically-defined function 
(clonedFunction) with elevated privileges. 
The eval function compiles and executes the 
evalCode string with the unprivileged principal of 
the web page. However, the attacker has also sup- 
plied a second argument that specifies the names- 


USENIX Association 


evaltodae = "Clonedrunction = \ 
function deliverPayload(){...}; \ 
Cloned unce on) Ys 


myElem = document.getElementByld 

("myMarquee") ; 
xbl_object = myElem.init.call; 
eval (evalCode, xbl_object); 


Figure 3: Exploit code that allows untrusted func- 
tions to be associated with privileged principals. 


pace for name resolution during the string evalu- 
ation. Normally, this argument does not cause a 
problem because it belongs to the same namespace 
as the caller’s namespace. However, xbl_object 
is a benign reference leak from a privileged names- 
pace. 

Exposing xb1l_ob ject is areference leak, but it 
is not sufficient for an attack because the interpreter 
invokes eval with the correct caller’s principals. 
However within eval, once run, the evalCode 
byte code gets access to a privileged namespace. 
This access by itself is still not a problem because 
evalCode runs with the web page principals, and 
thus will not be able to get past the stack inspection 
checks. Similarly, invoking deliverPayload 
directly within evalCode would not be problem- 
atic. 

The exploit occurs when evalCode creates 
a function referenced by clonedFunction. 
The interpreter creates a new function object 
in the privileged namespace that is a clone of 
deliverPayload. When a function 1s created 
by cloning, its principal is set to its object princi- 
pal, as described in Section 2.2.2. When the cloned 
function is invoked, it executes its payload with el- 
evated privileges. In effect, this exploit attaches a 
user-supplied function to a privileged namespace, 
making it appear privileged to the security manager. 
This vulnerability occurs because the implementa- 
tion of eval did not check that it was compiling 
code from one principal and executing it within the 
namespace of a more privileged principal. 

The patch for this vulnerability added a check to 
eval to ensure that the principal of the caller sub- 
sumes the object principal of the second argument. 
However, it was discovered that this patch could 
be bypassed by invoking eval indirectly using the 
timer method set Timeout. When the natively- 
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Var code: = “yee payload: «<a0994 
document .body._defineGetter_ 
("localName", Script (code) ); 


Figure 4: Simplified exploit code for Bug 289074. 


implemented timer fires, there are no JavaScript 
frames left on the stack, so the caller’s principal 
is the fully privileged principal of the native timer 
code. The next patch prevented eval from being 
called directly by native code. Further patches were 
needed to fix other attacks on eval. 


3.2.3 Luring Privileged Code 


Figure 4 shows the exploit code for a_ lur- 
ing attack. This exploit would trigger if the 
document .body.localName property is read 
by the UI code. This code tricks the privileged code 
into working with a different property than the one 
it expects by associating a getter function with a na- 
tive DOM object property (localName). Further- 
more, the Script object behaves like an eval-like 
function that allows strings to be precompiled and 
executed with the privileges of the caller’s princi- 
pal. The consequences are equivalent to privileged 
JavaScript executing a string of the attacker’s choos- 
ing, although no code is compiled in the privileged 
namespace. This vulnerability occurs because the 
caller accesses an overridden property. 

This problem was so widespread in Firefox 
1.0 that it motivated developers to implement the 
‘safety wrapper” mechanism that allows privileged 
scripts to work with native DOM objects without 
being exposed to any modifications made by web 
page scripts. However, even the latest releases of 
Firefox continue to suffer from bugs in assigning 
wrappers, thus allowing privileged scripts to interact 
with tampered DOM methods and properties [6]. 


3.2.4 Privileged Reference Leaks 


Figure 5(a) shows code that exploits a reference leak 
vulnerability in the QueryInterface XPCOM object. 
A flaw in the XPConnect code for setting up safety 
wrappers for native objects inadvertently sets a priv- 
ileged object as the prototype of the safety wrapper 
for QueryInterface in untrusted namespaces. Ma- 
licious code can use this leak to reach the global 


*This Firefox-specific object has been deprecated since 
Firefox 3.0, presumably due to security risk. 
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var leaked = 
QuerylInterface._proto__.._parent_j; 

var cid = {equals: Script (payload) }; 

leaked. foo.getClassObject (cid) ; 


(a) Simplified exploit code. 


var foo = { 
getClassObject: function(aCID) { 
if (aCID.equals (value) ) 
return this._objects[key]; 


j 
hi 
(b) Simplified target code. 


Figure 5: Exploit and target code for Bug 294795. 


object of a privileged namespace. The exploit calls 
the script method foo. getClassObject in the 
privileged namespace with a specially-crafted argu- 
ment to carry out a luring attack. 

The getClassOb ject method shown in Fig- 
ure 5(b) relies on namespace isolation and thus ex- 
pects to be called from other privileged functions 
with safe arguments. However, when it calls the 
equals method of its aCID parameter, it inadver- 
tently invokes the Script object defined by the at- 
tacker, executing it with full privileges. 


3.2.5 Loading Privileged URIs 


There are also attacks that use a combination of a 
bug that allows unprivileged pages to load higher 
privilege documents (e.g., “chrome” protocol URIs) 
and a cross-site scripting (XSS) bug to inject their 
own scripts into these pages. Bug 306261 allowed 
untrusted pages to bypass restrictions on loading 
privileged URIs of the “about” protocol (which al- 
lows setting browser configuration values) by using 
a malformed URI. We do not address XSS bugs or 
violations of URI loading policies, but our system 
is able to detect this category of attacks because it 
leads to code injection. 


3.3. Comparison With Memory Safety 


JavaScript extensions have many clear benefits, but 
they suffer from risks posed by these four classes 
of vulnerabilities. As a result, Firefox users have 
been victims of real-world privilege escalation at- 
tacks and the Firefox bug database shows that the 
incidence rate for these types of vulnerabilities is 
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comparable to memory-safety vulnerabilities (more 
on this in Section 5.1). 

At first, this may seem counter-intuitive: com- 
ponents written in a memory-safe, interpreted lan- 
guage should be more secure than their native equiv- 
alents. This intuition may be true in single-principal 
applications, but Firefox must execute JavaScript 
from multiple principals concurrently and must ar- 
bitrate over many possible interactions, which raises 
the specter of bugs leading to privilege escalation at- 
tacks. 

In fact, the classes of vulnerabilities we found 
for the multi-principal Firefox script environment 
are similar to memory-safety vulnerabilities found 
in single-principal native code. The code compila- 
tion vulnerabilities are not unlike buffer overflows: 
data is executed as code, allowing for arbitrary code 
execution. The luring vulnerabilities allow attackers 
to call existing functions of their choosing, similar 
to return-to-libc attacks [5]. 


4 Approach 


Script-based extensibility in the Firefox web 
browser is a powerful feature and is highly valued 
by its users. However, it leads to privilege escalation 
vulnerabilities precisely because of the dynamic and 
flexible nature of the script language used to imple- 
ment the extensions. The language features allow 
leveraging browser bugs or vulnerable extensions to 
confuse the browser into assigning wrong principals 
to code, thus bypassing stack inspection. 

Privilege escalation vulnerabilities also arise be- 
cause Firefox’s implementation of one-way names- 
pace isolation is inherently error prone. The 
browser fully trusts script extensions, but these 
scripts can interact with data from unprivileged 
sources in unsafe ways, compromising namespace 
isolation. One-way namespace isolation will not 
disappear from extensible browser architectures, as 
extensions will always need to read and modify un- 
trusted web pages. One method of improving the 
security of one-way namespace isolation is to pro- 
vide stronger isolation guarantees. For example, 
Google Chrome [10] divides an extension into sep- 
arate processes, one for for accessing privileged in- 
terfaces, and another for interacting with untrusted 
web pages, while only allowing IPC between the 
two processes. This architecture requires increased 
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implementation effort from the extension developer 
and is completely incompatible with the Firefox ex- 
tension model. 

Instead, our solution is to use tainting to aug- 
ment the browser’s security mechanisms. We use 
tainting because it helps detect when untrusted con- 
tent can affect privileged code. Furthermore, it 1s 
fully compatible with the current Firefox extension 
model. Unfortunately, many tainting-based systems 
suffer from endemic false alarms and thus are un- 
usable in practice [18]. In this section, we show 
that our tainting-based solution, while being con- 
ceptually simple, is well-suited for the browser’s se- 
curity model because namespace isolation already 
provides a security barrier between the taint sources 
in content namespaces and privileged code in exten- 
sion namespaces. 


4.1 Threat Model 


We define a privilege escalation attack as tainted 
data executing as privileged code. Tainted data is 
executed as privileged code if it 1s compiled into 
script byte code tagged with the wrong principals, 
or if tainted data is used as a reference to execute 
privileged code. Both scenarios lead to a failure of 
the browser’s security mechanism for guarding ac- 
cess to sensitive interfaces, allowing untrusted web 
pages to gain the ability to modify the host system. 

We add security checks and augment stack in- 
spection to look for the characteristic signature of 
privilege escalation attacks. To do so, we rely on 
the memory safety of the browser as well as the 
browser’s ability to correctly assign a principal to 
a web page when it is first loaded, before any web 
page scripts begin executing. Assigning this prin- 
cipal is straightforward as it only depends on the 
web page’s URI. We do not depend on the correct- 
ness of the rest of the code that assigns principals, or 
code that interprets principals. Instead, we “second 
guess” browser security code by auditing its secu- 
rity decisions with the additional taint status infor- 
mation. 


4.2 Tainting 


We consider all documents fetched from remote 
sources or local documents opened with the “file” 
protocol as untrusted and taint them because the 
browser does not assign them a privileged princi- 
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pal. When documents are loaded into the browser, 
they are parsed into a tree of native DOM objects, 
representing individual markup elements and their 
attributes. All nodes of the tree are individually 
marked tainted, including the text of any scripts de- 
fined inside the document, such as in event handlers 
or in SCRIPT tags, and taints are tracked separately 
for each attribute of a DOM element. 

Our tainting system uses different policies based 
on the privilege level of the executing script. Un- 
privileged code is completely untrusted and may 
be malicious, so we must unconditionally taint all 
script variables created or modified by executing 
scripts originating from untrusted (tainted) docu- 
ments. For privileged scripts, we use standard taint 
propagation rules that mark the output of JavaScript 
instructions as tainted when the instruction inputs 
are tainted. Tainting allows us to mark and track the 
influence of untrusted code throughout the browser. 

Tainting systems can suffer from excessive 
false alarms when using control-dependent taint- 
ing. Control-dependent tainting taints the output of 
any code whose execution depends on tainted data. 
For example, all outputs of an if-branch would be 
tainted if the condition variable were tainted. Con- 
trol dependence 1s necessary when the code process- 
ing the tainted data may itself be malicious. For ex- 
ample, detecting cross-domain information leaks re- 
quires accounting for implicit flows, since malicious 
web page scripts could leak information [19]. We do 
not use control-dependent tainting on the privileged 
side because we assume that the privileged scripts 
are trusted. We consider it highly unlikely that priv- 
ileged script code would accidentally launder taints 
through control flow and then execute the laundered 
data as privileged code. 

It is necessary to track taint both in the native 
code and inside the script interpreter. For exam- 
ple, when a new HTML document is loaded into a 
tab, privileged UI script code reads the tainted doc- 
ument’s title property and sets it as the caption of 
the tab element. This requires taints from native 
DOM objects associated with the HTML document 
to propagate to script variables in the UI code and 
then back to DOM objects associated with the UI 
document. On the native side, we track the taint sta- 
tus of string properties of XPCOM objects. Taint- 
ing code in XPConnect taints any JavaScript ref- 


19th USENIX Security Symposium 


erences to unprivileged DOM elements and prop- 
agates taints between the XPCOM and JavaScript 
environments. 


4.3. Attack Detection 


We define a privilege escalation attack as tainted 
data executing as privileged code. We implement 
two classes of attack detectors to detect this con- 
dition: compilation detectors and invocation detec- 
tors. Compilation detectors ensure that tainted data 
is never compiled into byte code tagged with priv- 
ileged principals, while invocation detectors moni- 
tor the stack for tainted references to function ob- 
jects creating privileged frames. Compilation de- 
tectors map closely to code compilation vulnerabil- 
ities, while invocation detectors are best suited for 
preventing luring attacks. 


4.3.1 Compilation Detectors 


We use compilation detectors as a proactive mea- 
sure to prevent tainted data from being compiled 
to privileged byte code, even if it 1s never exe- 
cuted. These detectors are well suited for secur- 
ing eval-like functions that compile strings into byte 
code, because the string’s taint status informs these 
functions of the string’s origin. These detectors al- 
low defending against compilation bugs such as the 
wrong principal attack (see Section 3.2.2). If na- 
tive XPCOM code compiles the strings, as in the 
URI code injection attack (see Section 3.2.1), or the 
XSS attacks (see Section 3.2.5), the detectors will 
use the taint status of XPCOM string objects to de- 
tect and prevent exploits. Our compilation detectors 
are placed before all calls to compilation functions, 
such as those defined by the JavaScript API. 


4.3.2 Invocation Detectors 


Invocation detectors monitor script execution for 
situations where tainted references to script or na- 
tive functions are invoked inside the interpreter and 
result in the creation of privileged stack frames. 
This policy catches luring attacks in which privi- 
leged scripts are tricked into invoking functions of 
the attacker’s choice. It also detects when an unpriv- 
ileged script uses a reference leak to directly call a 
privileged JavaScript function from an extension. 
The invocation detectors vary depending on 
whether the invoked functions are scripted or native. 
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Namespace isolation limits script functions to call- 
ing other script functions within the same names- 
pace. Therefore, our detectors watch for namespace 
pollution, namely callers invoking tainted function 
references that result in a privileged callee stack 
frame, as in the luring attack (see Section 3.2.3). 
This detector is able to intercede before any func- 
tion code is executed with elevated privileges. 

For native functions, it 1s not as straightforward 
to come up with a policy for detecting attacks. It 
can be perfectly safe for privileged scripts to in- 
voke natively defined methods of tainted object ref- 
erences. For example, an extension script could 
call the native toLowerCase string method on a 
web page’s title string. The reference to the title 
string will be tainted, and the function reference to 
the toLowerCase method will also be tainted be- 
cause it is accessed as a method of a tainted string, 
but this operation should not raise a privilege es- 
calation alert because, in and of itself, it does not 
represent a privilege escalation threat even if it is 
called from a privileged context. However, if the 
native function called through the tainted reference 
is a native XPCOM method that is only accessible 
to privileged callers, then a security violation needs 
to be raised as it indicates a luring attack. 

Thus, it is important to know whether the native 
callee is sensitive and whether the caller will be in- 
terpreted as privileged. We get this information by 
letting the call proceed, and if it reaches XPCon- 
nect, the security manager establishes the sensitivity 
of the target XPCOM method or property and per- 
forms a stack inspection to determine the effective 
subject principal of the caller. We augment the se- 
curity manager to signal an attack whenever it com- 
putes a privileged subject principal, but a tainted 
function reference is found on any stack frame dur- 
ing the stack walk. 


4.3.3 Reference Leaks 


As demonstrated in Section 5, we can detect and 
stop the vast majority of proof-of-concept exploits 
in the Firefox bug database based on reference 
leaks. We achieve these results by detecting at- 
tempts to directly invoke or lure privileged code 
with our invocation detectors, as in the reference 
leak attack (see Section 3.2.4), and by detecting ma- 
licious attempts to compile tainted strings with our 
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compilation detectors. However, we are unable to 
detect and prevent reference leaks. For example, 
in Figure 5(a), we cannot rely on the object refer- 
ence’s taint status to detect the privileged reference 
leak, because our tainting rules require that proper- 
ties of tainted objects, such as QueryInterface, also 
be marked tainted. 

Although we do not prevent reference leaks, at- 
tacks employing reference leaks will not be able to 
escape our tainting. Any data modified by untrusted 
scripts is still marked tainted, and invoking or com- 
piling tainted data will trip the detectors. Therefore, 
attackers will not be able to mount a privilege esca- 
lation attack, in which untrusted data is executed as 
privileged code. At most, if the reference leak al- 
lows access to arbitrary global variables in the priv- 
ileged namespace, attackers may be able to devise 
control dependent attacks and compromise the in- 
tegrity of extension logic. 

Barth et al. [11] propose a system for detecting 
reference leaks between different security origins. 
Although their work aims to prevent cross-origin 
attacks made possible by reference leaks, it could 
also be integrated with our system to detect refer- 
ence leaks from privileged namespaces. We should 
note that reference leaks are not a requirement for 
mounting luring attacks. As previously described in 
section 3.1.2, the target of any luring attack can also 
be a call to an eval-like function (such as the Script 
object) or a reference to a sensitive method of an 
XPCOM object legitimately present in the content 
namespace. 


4.3.4 Unsafe XPCOM Arguments 


We are currently conducting a study to determine 
the extent of this class of vulnerability. We plan 
to create a list of sensitive parameters of security- 
sensitive XPCOM interfaces known to the security 
manager to mitigate the threat of tainted XPCOM 
arguments. We would need to provide untainting 
functionality to allow privileged scripts to indicate 
that a tainted argument has been sanitized. Other 
systems, such as Saner [9], allow validating saniti- 
zation routines. 


4.4 Implementation 


In this section, we describe the implementation of 
our tainting system in the JavaScript interpreter and 
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the XPCOM classes and our attack detectors. In our 
system, we are most concerned about the taint sta- 
tus of strings and function references because priv- 
ilege escalation attacks require either luring privi- 
leged code or compiling attacker strings. We chose 
not to use an existing system-level tainting solution 
because control dependent tainting is not required 
in our system and low-level tainting systems tend to 
produce a large number of false positives. 


4.4.1 JavaScript Interpreter 


JavaScript tainting requires associating a notion of 
taint with each script variable. JavaScript vari- 
ables can hold the values of primitive data types 
such as booleans and integers, or they can hold 
references to heap allocated data, such as objects, 
strings, and doubles (hereafter collectively referred 
to as “objects”). All accesses to object variables 
are done by reference. We transparently convert all 
tainted primitive variables to doubles (a reference 
type) so that our tainting code exclusively deals with 
reference types. For reasons which we will dis- 
cuss shortly, we do not taint the actual heap object 
pointed to by the reference (e.g. the floating point 
value of a double variable), but instead we only ever 
taint the individual references (pointers). For exam- 
ple, it is possible to have both a tainted and an un- 
tainted reference (pointer) to the same string. There- 
fore, variables of all data types are tainted in the 
same way, 1.e. by tainting individual references. 
When we implemented our tainting system, we 
had a choice between associating taint status with 
objects or with references to objects. We believe 
that it is a mistake to associate taint with objects 
because objects can be safely shared across privi- 
leged and unprivileged namespaces. For example, 
if a string variable were to be defined in a privi- 
leged namespace and then assigned to a variable in 
an unprivileged namespace, and unprivileged code 
were then to copy it into another variable, the origi- 
nal reference and the copy should not have the same 
taint status although they reference the same heap 
object. The value of the copied variable was clearly 
influenced by untrusted code, whereas the original 
variable was not. Note that strings and doubles are 
immutable, so there is no risk of modification by 
untrusted code. In other words, whenever a string 
or a double is modified, a new object is created 
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with the new value and the original remains un- 
changed. For mutable JavaScript objects, our pol- 
icy is to taint individual property references when 
they are modified by untrusted code. If we were to 
taint by object instead of by reference, we would 
run the risk of excessive, unnecessary taint propa- 
gation. For example, if an extension stores a tainted 
value in a property of a commonly used object, the 
object itself would become tainted. Therefore, any 
existing fields or methods of the object would also 
become tainted without receiving any tainted data. 
Such tainting could lead to false positives. The most 
egregious example of such unnecessary taint prolif- 
eration occurs when an extension copies a tainted 
variable into its global namespace, which is itself an 
object. Tainting the global object instead of merely 
tainting the property reference would unnecessarily 
taint all existing variables in the trusted extension 
namespace. 

Therefore, we implemented variable tainting by 
storing a taint bit inside each variable. Internally, 
JavaScript variables are a machine word with a few 
of the least significant bits reserved for a type tag 
used for dynamic typing. We set aside an extra bit 
in the type tag for the taint status. The upper bits 
of primitive variables contain the variable’s value, 
while the upper bits of references contain a pointer 
to a memory-aligned heap object. A downside of 
our reference tainting approach is increased mem- 
ory use because heap objects now have to align at 
bigger boundaries. Specifically, we can store half 
as many JavaScript objects within a single memory 
page. This may seem like a large overhead for our 
approach, but the heap-allocated data structures are 
very small because the data structures use unaligned 
pointers to point to their actual contents. For ex- 
ample, the aligned, heap-allocated string data struc- 
ture consists of two member variables: the string 
length and a pointer to an unaligned character array 
stored elsewhere on the heap. In practice, we find 
the overhead is not significant because JavaScript 
heap memory accounts for only a small portion of 
the Firefox memory footprint. Empirical measure- 
ments confirm that the increase in Firefox’s data res- 
ident set size is less than 10% in everyday browsing, 
even on JavaScript-heavy sites such as GMail. 

We added code to propagate taint between the 
inputs and outputs of each of the 154 opcodes in 
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the JavaScript interpreter as well as code to un- 
conditionally taint all outputs produced by unpriv- 
ileged scripts. In addition to the aforementioned 
data types, scripts can also make use of a num- 
ber of built-in objects and top-level properties and 
functions defined by the JavaScript language. Some 
built-in objects provide more advanced data types 
such as the “Date” and “Array” objects, while 
other built-ins provide utility functionality such as 
the “Math” object and the “encodeURI” function. 
Instead of painstakingly modifying each of these 
methods and functions individually to propagate 
taints, we conservatively taint the return values from 
any built-in function or method if any supplied ar- 
guments are tainted. or example, the returned 
values from Math. sqrt (X) or encodeURI (X) 
will be tainted if X is tainted. Finally, we had to 
make a few manual changes in the interpreter code 
to prevent loss of taint. For example, object refer- 
ences were sometimes converted into raw pointers 
and then the same raw pointers were converted back 
into object references without restoring the taint bit 
in the type tag. 


4.4.2 XPCOM 


We track the taint status of string objects in the XP- 
COM code because it is possible for native and in- 
terpreter code to compile strings into attack code. 
We also pay special attention to tracking taint in 
DOM string properties as these properties are the 
initial taint source and a very common taint sink. 

We have borrowed the XPCOM string-tainting 
implementation from Vogt et al. [19]. This imple- 
mentation adds taint flags to XPCOM string classes 
and modifies string class methods to preserve taint. 
We extended it to more string classes and made a 
small number of manual changes to account for the 
taint laundering that occurs in the code base when 
raw string pointers are extracted from string objects 
and used to create new string objects. 

The XPCOM implementations of markup ele- 
ments, representing the contents of the browser UI 
and web pages, do not store all their string prop- 
erties within XPCOM string classes. The string 
properties of these DOM elements are a significant 
source and propagation vector for tainted data, so 
we needed to associate each string property of a 
DOM element with a taint status. To this end, we 
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modified a small number of base classes from which 
DOM elements of all types are derived. DOM 
classes redirect calls to get or set individual prop- 
erties to a handful of methods in these base classes, 
allowing us to add taint-propagation behavior and 
to automatically taint string properties of elements 
in unprivileged documents. 

Adding taint tracking for every type of XPCOM 
property is difficult because there is no elegant way 
to associate taint status with primitive data types in 
the native XPCOM code. However, it is straight- 
forward to taint all script references to unprivi- 
leged DOM objects. We added a taint bit to the 
“wrappers” used to reflect XPCOM objects into 
the JavaScript environment as well as the wrap- 
pers used to reflect JavaScript objects into XPCOM 
code. The first time XPConnect 1s asked to reflect a 
given object between the two environments, it cre- 
ates a new wrapper object in the destination envi- 
ronment. For wrappers around XPCOM objects, we 
alter the wrapper creation process to check whether 
the wrapped object is a DOM node and if so, if it 
belongs to an unprivileged document. When the 
wrapper is placed in a JavaScript namespace, we 
make sure its object reference is tainted. The taint- 
ing rules in the interpreter automatically taint the 
values obtained from reading tainted objects’ prop- 
erties, effectively tainting all string and non-string 
properties of unprivileged DOM elements. Simi- 
larly, when a JavaScript object or function reference 
is wrapped for the XPCOM environment (e.g., a 
JavaScript callback function), we make sure its taint 
status 1s preserved and therefore propagated during 
a property read or a function call. 


4.4.3 Attack Detectors 


Once we determined the detection policies de- 
scribed in sections 4.3.1 and 4.3.2, 1mplementa- 
tion of the attack detectors became straightforward. 
The compilation detector code was added to the na- 
tive functions that turn strings into bytecode (such 
as “eval’’), while the invocation detector code was 
added to the code that implements JavaScript func- 
tion calls. The only challenge was in finding the 
appropriate sites to install the detectors so that 
all JavaScript compilation and function invocations 
could be audited. The detectors had to be close 
enough to the low-level compilation and invocation 
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code to intercept all the relevant call paths, but at the 
same time sufficiently high-level to easily retrieve 
principals and taint status. 


5 Evaluation 


We have implemented the approach described above 
in the Firefox browser. In this section, we eval- 
uate our system by demonstrating its effectiveness 
against privilege escalation attacks. We start by 
showing how well it prevents attacks on known Fire- 
fox vulnerabilities. These vulnerabilities are docu- 
mented in Firefox’s Bugzilla bug database, which 
provides detailed security reports, proof-of-concept 
exploits and any available bug fixes. Next, we show 
that our system has minimal impact on normal us- 
age by evaluating any false alarms that are raised 
and the performance overhead. 

We evaluated against proof-of-concept attacks 
from Mozilla’s bug database because the vulnerabil- 
ities are well cataloged and the proof of concept at- 
tacks are readily available. Most extension authors 
do not invest as much effort as Mozilla into docu- 
menting security issues in their code, thus making 
it difficult to evaluate our system against attacks on 
specific extensions. However, the same vulnerabili- 
ties could be leveraged against extensions. 

We have implemented our system on Firefox ver- 
sion 1.0.0, which we use for all the experiments. We 
chose this version because it has the largest number 
of known privilege escalation bugs, allowing more 
extensive testing of our system. Also, the Firefox 
security team has a policy of embargoing reports 
for recent vulnerabilities, except for exploits already 
available in the wild. As a result, recent versions of 
Firefox have far fewer available privilege escalation 
exploits. For example, as of the end of 2009, the 
current version of Firefox (v3.5) has several privi- 
leged escalation vulnerabilities as shown below but 
no publicly available exploits for them. We plan to 
port our system and evaluate our results for newer 
versions of Firefox as exploits become available in 
the bug database. 


5.1 Vulnerability Coverage 


Table 1 shows the continuing threat posed by priv- 
ilege escalation (PE) vulnerabilities in the Firefox 
browser. This table shows the total number of crit- 
ical vulnerabilities and the number of critical PE 
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Table 1: Vulnerability Statistics. 


vulnerabilities in the various major versions of the 
browser. The last column shows the percentage 
of PE vulnerabilities. Most PE vulnerabilities are 
generally classified as critical, and thus we do not 
show the statistics for non-critical vulnerabilities. 
Table 1 shows that PE vulnerabilities comprise 2/3 
of all critical Firefox 1.0 vulnerabilities. All other 
versions continually have about 1/3 PE vulnerabili- 
ties. The main reason is that Firefox 1.5 implements 
safety wrappers that limit the opportunities for un- 
safe interactions between privileged code and web 
content, as described in Section 3.2.4. 

Table 2 shows all the 19 privilege escalation advi- 
sories affecting Firefox 1.0.0, with some advisories 
containing multiple bug reports. Note that there are 
26 such advisories in Firefox 1.0 (of which 18 are 
critical as shown in Table 1), but the other seven do 
not run on Firefox 1.0.0 and so we are unable to re- 
produce them. We were unable to test our system 
against 5 out of the 19 advisories because exploits 
were not available for them. The last column shows 
the types of vulnerabilities exploited in each advi- 
sory. For reference leaks, we also show whether 
the leak is leveraged to compile code (C) with the 
wrong principals or execute a luring attack (L). 

Our system guards against 13 out of the 14 vul- 
nerabilities described in the advisories. We do not 
detect an attack on the vulnerability in advisory #6. 
In this attack, an untrusted HTML string is parsed 
by the HTML parser to generate new HTML ele- 
ments in a privileged document. Currently, we lose 
taint because we have not implemented taint propa- 
gation within the HTML parser. 


5.2 False Positive Evaluation 


We also tested our system by installing the top 
10 most popular extensions that were available for 
Firefox 1.0.0, and then we manually browsed the 
Web. These extensions are Adblock Plus, Foxy- 
Tunes, NoScript, Forecastfox, Add N Edit Cookies, 
PDF Download, StumbleUpon, 1-Click Weather, 
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# Advisory Name Type of Vulnerability 


2006-1 Leak © 
Leak © 
i Leak © 
Leak (©) Leak CL) 
6 | 2005-49 | Seip injoton fom Fitefox sidebar panel using dati! | Compilation 
8 | 2005-43 | “Wrapped” javascript: URLs bypass security checks Compilation 


14 | 2005-12 | javascript: Livefeed bookmarks can steal private data Compilation 


Embargoed, or exploit not available 


15 | 2006-24 | Privilege escalation using crypto.generateCRMFRequest N/A 
16 | 2006-05 | Localstore.rdf XML injection through XULDocument.persist() N/A 
17 | 2005-58 | Firefox 1.0.7 / Mozilla Suite 1.7.12 Vulnerability Fixes N/A 


18 | 2005-45 | Content-generated event vulnerabilities 
2005-27 | Plugins can be used to load privileged content 


Table 2: Vulnerability Coverage. 


MR Tech Toolkit and FLST. A user, who is not as- 
sociated with the project, browsed the Web for 5 
hours, specifically visiting the top 100 most heav- 
ily visited web sites, as ranked by Alexa [2]. The 
user interacted extensively both with the web sites 
as well as with the extensions (e.g., directly invok- 
ing extension functionality by setting preferences). 
The user’s testing caused one alarm. This 
alarm was caused by Forecastfox, which dis- 
plays the current weather forecast for a city of 
the user’s choice. When a user searches for 
his city while setting his preferences, Forecastfox 
queries accuweather.com for cities matching 
the search string. When the user selects his city 
from the search results, Forecastfox concatenates 
several strings together including the full city name 
fetched from the web site and eval’s this expres- 
sion to set the city option. Since the city name 
string originates from an untrusted web page, and 
the expression is evaluated in a privileged context, 
the alarm is raised. This code is unsafe because if 
the web site were compromised, the browsers of all 
Forecastfox users could be exploited. After seeing 
this alarm, we researched and found that Forecast- 
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fox for Firefox 3.0 has removed the eval state- 
ment. 

We also performed automated testing by writing 
a Web crawler extension for Firefox. The crawler 
extension takes as input a list of web sites to visit 
and directs Firefox to load any HTML or JavaScript 
links found in the web site in depth-first order and 
interacts with each loaded page in Firefox to mimic 
the behavior of a human user. On each page, the 
crawler chooses multiple events to send to the page 
(e.g. mouse clicks, key strokes) and fills out and 
submits any HTML forms. The crawler exercises 
the JavaScript in the browser UI by performing one 
of several scripted GUI actions such as viewing the 
web page’s HTML source code. We also installed 
AdBlock and Flashblock extensions and had the 
crawler randomly add and remove AdBlock filters 
on each page visited. The full crawler test visited 
100 pages from each website in the Alexa Top 200. 

The automated testing resulted in the discov- 
ery of one false positive, triggered by selecting 
“Page Source” from Firefox’s “View” menu. The 
offending UI JavaScript retrieves a (tainted) refer- 
ence to a window object from the content names- 
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pace. The window object implements multiple in- 
terfaces and some of these are sensitive interfaces 
inaccessible to web page scripts. The UI script casts 
the reference to the window object to a sensitive 
interface, further propagating the taint. When the 
privileged code calls a sensitive method of this in- 
terface through the tainted reference, our detectors 
flag it as a luring attack. This is not likely an ex- 
ploitable vulnerability, but it would be safer if priv- 
ileged JavaScript obtained references to sensitive 
interfaces without going through a content names- 
pace. 

While our testing is limited to heavily visited web 
sites, we believe that our system will not gener- 
ate many false positives with other web sites. We 
find that privileged scripts are careful when operat- 
ing on untrusted data and they are selective about 
the strings they compile in their privileged context 
(1.e., compilation false positives). Second, names- 
pace isolation works well enough in non-malicious 
environments, and thus it is difficult for privileged 
function references to become tainted (1.e., luring 
false positives). Similarly, web pages don’t expect 
to have access to privileged references and thus are 
unlikely to access them legitimately (1.e., reference 
leak false positives). 


5.3. Performance 


During regular browsing, we did not notice any 
degradation in page load times or responsiveness. 
We also conducted experiments to quantify the per- 
formance overhead of our system. We ran the Dro- 
maeo JavaScript Tests and the DOM Core Tests 
from Mozilla’s performance test suite [3]. These 
tests are micro-benchmarks that measure 1) the per- 
formance of basic operations of the script inter- 
preter, and 2) the performance of common DOM op- 
erations. Our experiments were run on Ubuntu 8.04 
Linux on an Intel Core 2 Duo 2.4 GHz processor, 
with 2 GB of memory. Our browser had 28% over- 
head for the JavaScript tests and 32% overhead for 
the DOM tests. Although the overhead witnessed in 
these micro-benchmarks does not visibly influence 
the browsing experience, the overhead may become 
an impediment to the adoption of our system at a 
time when JavaScript performance is becoming a 
competitive feature for modern browsers. One pos- 
sible research direction would be to investigate how 
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to efficiently integrate our tainting system with the 
just-in-time compilation systems present in modern 
JavaScript engines. 


5.4 Security Analysis 


Our system effectively detects nearly all available 
proof-of-concept attacks with few false positives. 
Admittedly, these proof-of-concept attacks were not 
designed with our detection system in mind. In or- 
der to defeat our defenses, an attacker would need 
to find a means of removing taint from untrusted 
objects. It would be difficult to remove taint in 
the JavaScript interpreter as the tainting rules are 
straightforward. The most likely target for launder- 
ing taint would be the native XPCOM methods. 


One possible way for the browser to lose taint 
is to store tainted objects outside the browser. For 
example, if a user saves a malicious URL string 
from a web page as a bookmark, the bookmark is 
stored in a bookmarks file and the URI’s taint is 
no longer present when the browser is restarted. A 
second, more involved method may be to launder 
taint through XPCOM method arguments. The at- 
tack begins by tricking an extension into passing a 
tainted, privileged object (a luring target) to an XP- 
COM function. If this function then natively calls 
a privileged native method of the tainted argument, 
our system would not detect this as a luring attack. 
This is because the extension JavaScript did not di- 
rectly invoke a privileged method through a tainted 
reference. Similarly, if an XPCOM function were to 
accept a tainted object as an argument but then re- 
turn a different, but related untainted object, it may 
be accurate to say the taint was laundered. Note that 
in these examples, the arguments and return values 
could not be strings as taint is always propagated 
during XPCOM string operations. 


Although laundering taint is theoretically possi- 
ble within our system, our system greatly raises 
the bar for potential attackers. The attackers now 
not only need to find a privilege escalation vul- 
nerability in the browser, they also require exten- 
sion JavaScript that interacts with specific XPCOM 
methods in such a way as to launder taint from cru- 
cial attack variables. 
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6 Related Work 


This work focuses on securely executing untrusted 
scripts by using taint-based stack inspection. Stack 
inspection is widely used by modern component- 
based systems, such as Java and Microsoft .NET 
Common Language Runtime, to ensure that remote 
code is sufficiently authorized to perform a security- 
sensitive operation. Wallach et al. [20] provide in- 
structive background on stack inspection. 


Taint analysis helps determine whether untrusted 
data may influence data that is trusted by the sys- 
tem. Newsome and Song [16] use dynamic taint 
analysis to taint data originating or derived from 
untrusted network sources. An attack is detected 
when tainted data is used in a dangerous way, such 
as overwriting a return address. We use a similar 
approach to ensure that dirty data is not executed in 
a trusted context. Vogt et al. [19] use script tainting 
in a browser to track sensitive browser data, such as 
browser cookies or the URLs of visited pages. 


The same origin policy is the basic sandboxing 
method used by web browsers. An effective method 
for implementing the same origin policy is script 
accenting [12], which uses simple XOR encryption 
to ensure that code is loaded and run, and data is 
created and accessed, by the same principal. Sev- 
eral recent projects [22, 17] attempt to enforce the 
same origin policy by separating different origins 
into different processes. In order to adopt this archi- 
tecture, the extension model needs to be redesigned 
to accommodate extensions’ interactions with pages 
from different principals [10]. The same origin pol- 
icy is too strict for mashup web applications. For 
such applications, Mashup OS provides abstractions 
to allow limited communication while protecting 
the different principals associated with mashup con- 
tent [21]. Interestingly, Mashup OS introduces the 
same set of problems as privileged extensions inter- 
acting with untrusted content and thus would benefit 
from our solution. 


In concurrent work, Barth et al [10] propose a 
new browser extension model for Google Chrome. 
Extensions and web page scripts are isolated us- 
ing processes and “isolated worlds” so that they 
never exchange JavaScript pointers. This architec- 
ture raises the bar for perpetrating a successful priv- 
ilege escalation attack as multiple components now 
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need to be compromised. Their design has obvious 
advantages, but the threat of privilege escalation at- 
tacks has not been completely eliminated. For ex- 
ample, Google recently fixed a vulnerability that in- 
correctly allowed JavaScript to be executed in the 
context of a Chrome extension [7]. 

Since browser extensions typically run with unre- 
stricted privileges, a malicious extension can serve 
as a powerful attack vector. Louw et al. [15] pro- 
pose access control for limiting extension privi- 
leges. For example, certain extensions may not be 
allowed access to the password manager. Dhawan 
and Ganapathy [13] propose adding an information- 
flow tracking system to Firefox to assist in deter- 
mining whether a JavaScript extension maliciously 
compromises browser confidentiality or integrity. 
Although we are also interested in misuses of low- 
integrity data, their system is not an online attack 
detector and it requires human analysis. 

Recent versions of Firefox use security wrap- 
pers (e.g., XPCNativeWrappers, XPCChromeOb- 
jectWrappers, etc.) to regulate interactions be- 
tween JavaScript and XPCOM objects from differ- 
ent namespaces [8]. Unfortunately, implementa- 
tion bugs in creating and manipulating wrappers are 
fairly common. Our system adds another layer of 
security on top of wrapper techniques by effectively 
second guessing wrapper security decisions. 


7 Conclusion 


Script-based privilege escalation attacks are a se- 
rious and recurring threat for extensible browsers 
such as Firefox. In this paper, we describe the 
pitfalls of script-based extensibility in Firefox and 
show that the privilege escalation vulnerabilities are 
similar to arbitrary code injection and execution 
control vulnerabilities found in unsafe code. Then, 
we propose a tainting-based system that specifically 
targets each class of vulnerability. We implemented 
such a system for the Firefox 1.0 browser and our 
evaluation shows that it detects the vast majority of 
attacks in the Firefox bug database with almost no 
false alarms and moderate overhead. 

Our vulnerability classification and our proposed 
defense system are inevitably linked to the Fire- 
fox browser. However, one-way namespace isola- 
tion must exist in browser extension architectures 
because extensions need access to restricted APIs 
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and they also need to read and modify untrusted 
web pages. As such, we expect our analysis and 
results to be applicable to other script-extensible 
browsers.We plan to test the generality of our vul- 
nerability classification and defenses against other 
browsers, especially Google Chrome as it also pro- 
vides powerful script extension functionality. 
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Abstract 


Web publishers frequently integrate third-party adver- 
tisements into web pages that also contain sensitive pub- 
lisher data and end-user personal data. This practice ex- 
poses sensitive page content to confidentiality and in- 
tegrity attacks launched by advertisements. In this pa- 
per, we propose a novel framework for addressing security 
threats posed by third-party advertisements. The heart of 
our framework is an innovative isolation mechanism that 
enables publishers to transparently interpose between ad- 
vertisements and end users. The mechanism supports fine- 
grained policy specification and enforcement, and does 
not affect the user experience of interactive ads. Evalua- 
tion of our framework suggests compatibility with several 
mainstream ad networks, security from many threats from 
advertisements and acceptable performance overheads. 


1 Introduction 


On September 13, 2009, readers of the New York Times 
home web page were greeted by an animated image of a 
fake virus scan. Amidst widespread confusion, NY Times 
clarified the situation in an article [48], explaining the 
source of the rogue anti-virus attack was one of its adver- 
tising partners. Just two months prior, members of social 
web site Facebook were presented with advertisements 
(henceforth, “ads’’) deceptively portraying private images 
of their family and friends [38]. Facebook responded in an 
article [42] blaming advertisers for violating policy terms 
governing the use of personal images. 

Publishers of online ads (like the NY Times and Face- 
book) face two serious challenges. They must ensure ads 
will neither violate the integrity of publisher web pages 
(as occurred with NY Times), nor breach confidentiality 
of user data present on publisher web pages (as occurred 
with Facebook). Ads are often tightly integrated into pub- 
lisher web pages, and therefore must coexist with high in- 
tegrity content and sensitive information. Typically, ad 
content is dynamically fetched from ad networks (e.g., 
Google AdSense) by the user’s browser, leaving little op- 
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portunity for publishers to inspect and approve ads before 
the ads are rendered. 

Online advertising is currently a lucrative market, ex- 
pected to hit the US$50 billion mark in the U.S. dur- 
ing 2011 [52]. For many publishers, online advertising 
iS an economic necessity. However, publishers have few 
resources enabling them to enforce integrity and confiden- 
tiality policies on ads. One common approach is for ad 
networks to screen each ad for potential attacks. This pas- 
sive approach simply shifts the burden of protection from 
publisher to ad network. To enforce compliance, publish- 
ers must use out-of-band mechanisms (e.g., legal agree- 
ments), which leave the publisher vulnerable to any gaps 
in the ad network’s screening strategy. Rogue ads may 
slip through and cause damage, as in the above, high pro- 
file examples. 

Due to the dangers of rogue ads, publishers are in 
great need of an active, technological approach to protect 
themselves and their end users. Therefore, in this paper 
we confront the problem of rogue ads from a publisher- 
centric perspective. At a basic level, a publisher is a 
web application that includes dynamically sourced con- 
tent from an ad network in its output. Our objective is 
to empower this web application to serve ads from main- 
stream ad networks, while protecting its end users from 
several threats posed by rogue ads. 


1.1 Contributions 


In this paper, we present ADJAIL, a framework that aids 
web applications to support rendering of ads from main- 
stream ad networks without compromising publisher se- 
curity. Our framework achieves this protection by apply- 
ing policy-based constraints on ad content. There are five 
significant contributions of our approach: 


1. Confidentiality and integrity policy specification and 
enforcement. We define a simple and intuitive policy 
specification language for publishers to specify several 
confidentiality and integrity policies on advertisements 
at a fine-grained level. We provide a novel and con- 


19th USENIX Security Symposium = 371 


372 


ceptually simple policy enforcement mechanism that 
offers principled security guarantees. 


2. Compatibility with ad network targeting algorithms. 


Ad networks use targeting algorithms to select which 
ads to display, based on several factors such as page 
context and user behavior. In many cases, these al- 
gorithms are implemented as scripts that analyze pub- 
lisher content to select and fetch appropriate ads to 
be displayed. Our approach supports these targeting 
scripts, with the added benefit of restricting the target- 
ing script’s access to sensitive data. 


3. Compatibility with ad network billing operations. Ad 
networks employ complex billing strategies based on 
several metrics, including ad impressions (number of 
times an ad is shown) and mouse clicks. Furthermore, 
ad networks have mechanisms for dealing with click 
fraud [2]. To remain transparent to billing and click- 
fraud detection mechanisms, our approach preserves 
impression and click metrics. 


4. Consistency in user experience. Our approach does not 


affect the user experience in interacting with ads, for 
any change in the user experience (in terms of content, 
position and interactivity) may reduce the effectiveness 
of advertising. Furthermore, ADJAIL highlights the se- 
curity trade-offs that are required for ensuring consis- 
tency in user experience for certain types of ads (such 
as inline text ads). 


5. Satisfaction of practical deployment requirements. 


Publishers should not have to expend significant labor 
in adopting a new framework, as this may make adop- 
tion prohibitively expensive. Furthermore, publishers 
should be able to deploy a solution that does not require 
end users to install new client software (e.g., browsers, 
plug-ins, etc.) or make changes to their existing client 
software. Therefore, we offer a practical solution that 
is easy to adopt, and works on mainstream browsers in 
their default settings, without any modifications. 


1.2 Overview 


The crux of our approach is a novel policy enforcement 
strategy that can be employed by the publisher to interpose 
itself transparently between the ad network and end user. 
The enforcement strategy starts by fetching and execut- 
ing ads in a hidden “sandbox” environment in the user’s 
browser, thus shielding the end user and web application 
from many harmful effects. 

In order to preserve the user experience, all ad user in- 
terface elements are then extracted from the sandbox and 
communicated back to the original page environment, as 
permitted by the publisher’s policy. This step enables the 
user to see and interact with the ad as if no interposition 
happened. All user actions are communicated back to the 


19th USENIX Security Symposium 


sandbox, thus completing a two-way message conduit for 
synchronization. Our approach ensures transparency with 
regard to the number of ad clicks and impressions by inter- 
posing on the browser’s Document Object Model to sup- 
press extraneous HTTP requests. 

We have built a prototype implementation of AD- 
JAIL that supports specification and enforcement of fine- 
grained policies on ads sourced from leading ad networks. 
The prototype is designed to be compatible with several 
mainstream browsers including Google Chrome, Firefox, 
Internet Explorer (IE), Safari and Opera. One minor lim- 
itation of our implementation (but not of our architecture) 
is that it is not compatible with IE 7.x or below. However, 
the current ADJAIL prototype is compatible with IE 8.0 
and later. 

We evaluate ADJAIL on the dimensions of ad network 
compatibility, security, and performance overheads. Our 
compatibility evaluation tested ads from six mainstream 
ad networks. We find that ADJAIL provides excellent 
compatibility for most ads. We also demonstrate the 
strong protection offered by ADJAIL from many signifi- 
cant threats posed by online ads. In our experiments, the 
currently unoptimized ADJAIL prototype encountered at 
most a 1.69 slowdown in rendering ads. 

The remainder of this paper is organized as follows: 
Section 2 provides the threat model, scope and related 
work. We provide the architecture and the main ideas be- 
hind ADJAIL in Section 3. Section 4 discusses the details 
in the implementation of ADJAIL. Our security, compati- 
bility and performance evaluation appears in Section 5. In 
Section 6 we conclude. 


2 Threat Model and Related Work 
2.1 Threat model 


Consider a publisher who wishes to carry ads on a web- 
mail (Web-based email) application. We will use this as 
a running example throughout the paper to illustrate the 
various aspects of our framework. A screenshot from an 
actual webmail application we used in our evaluation ap- 
pears in Figure 1. The top pane of the window presents the 
message list and the bottom pane presents the email mes- 
sage text. Four numbered advertisements also appear in 
the figure: (1) a banner ad that appears on top of the web- 
mail page, (2) a skyscraper ad that appears as a sidebar, 
(3) an inline text ad that appears when the user’s mouse 
hovers over an underlined word, and (4) a floating ad that 
overlays the image of a clock on the page. 

These ads highlight two interesting challenges we need 
to overcome. First, the sidebar ad requires access to the 
email message text, which it mines to ascertain page con- 
text and select relevant ads for display (1.e., contextual tar- 
geting). The inline text ad also requires access to the mes- 
sage for contextual targeting and to integrate ads among 
the text. However, supporting these ads by providing ac- 
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Figure 1: Samples of various ad types. A webmail application with (#1) banner and (#2) skyscraper ads. Also illustrated are (#3) an 


inline text ad and (#4) an floating ad. 


cess to the entire message carries the risk of exposing pri- 
vate content (e.g., email addresses) to the ad script. Sec- 
ond, the floating ad requires access to the real estate of the 
page to place the image of the clock over the message text. 
However, providing access to the page real estate enables 
an ad to overlay content over the entire page, which may 
interfere with trusted interface components. 

These common examples illustrate how ads require 
non-trivial access to publisher content and the screen, and 
will not work if such access is denied. Also, in all of the 
examples above, the ad content is loaded and rendered by 
a third-party ad script (an ad script example appears in 
Figure 4a). Ad scripts are given full page access by de- 
fault, and thus pose threats to the confidentiality and in- 
tegrity of page content. Our goal is to support the non- 
trivial access required by these and many other typical 
forms of ads, while addressing the security concerns of 
executing third-party ad scripts. 


2.2 Threat scope 


Web applications that display third-party content on client 
browsers are exposed to a wide variety of threats. It is 
therefore important to clarify our threat model, specifi- 
cally on the nature of protections that we offer and the 
threats that are outside the scope of this work. 


In-scope threats The broad threats that we address in this 
work are those targeted by recent efforts in the Web stan- 
dards community for content restrictions (e.g., Content 
Security Policy [32, 43]). These policies are specified by 
a website to restrict the capabilities of third-party scripts, 
specifically with reference to access and modification of 
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first-party (site owned) content, as well as control over the 
screen. Policies can be negotiated between a publisher 
and its customers, or directly reflect the site security and 
privacy practices. 

Our framework provides a means for specification and 
enforcement of such policies. For instance, in our web- 
mail example, an integrity policy can be enforced such 
that email message content cannot be tampered with, but 
can still be read (for contextual targeting of ads). Publish- 
ers may also choose to restrict where ads can appear on 
the page. 

Publishers can also use our framework to enforce poli- 
cies about confidentiality of content. For instance, a pub- 
lisher can enforce a policy that mail headers and email 
‘address books” (containing private email addresses) can- 
not be read by ads. For the Facebook attack in §1, a policy 
specifying confidentiality of user images, combined with 
our enforcement mechanism, would have prevented the 
attack. 


Out-of-scope threats Many security threats posed by ads 
(and other third party content) have been identified by the 
security community. Recently, there has been intense re- 
search in this area which can complement our approach 
for protection against specific attacks. In particular, our 
work does not address the threats listed below. In this sec- 
tion we omit threats for which publishers can readily de- 
ploy strong protection (e.g., cross-site request forgeries). 


1. Browser security bugs. We do not address browser vul- 
nerabilities such as drive-by-downloads [49, 36, 5], at- 
tacks launched through plug-ins [24], vulnerabilities in 
image rendering [23] and so on. 
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2. Opaque content. Our approach leverages web content 
introspection capabilities of JavaScript, and is there- 
fore most capable of enforcing fine-grained control 
where such transparency is available. Although our ap- 
proach provides coarse-grained confidentiality and in- 
tegrity protection from opaque content (e.g., Flash), the 
many possible attack vectors from these binary formats 
require special treatment [13]. 


3. Frame busting & navigation attacks. These are diffi- 
cult attacks for any dynamic policy enforcement mech- 
anism to prevent, due to the limited API exposed by 
browsers. A detailed discussion of protection measures 
against frame busting has been explored [39] and could 
be used to enhance our approach. 


4. Behavior tracking attacks. These are attacks that track 
a user across multiple sites and sessions through use of 
cookies. These could be addressed by users choosing 
restrictive cookie policies, though such policies may 
interfere with the functionality of some web sites. 


5. Attacks through side channels. Sites can track users 
through side channels, such as the cache timing chan- 
nel [11], the “visited links” feature of browsers [19] 
and so on. It is difficult to defend these vectors without 
browser customization, which is impractical for pub- 
lishers to deploy. 


2.3 Related Work 


Privacy and behavioral targeting A few recent ap- 
proaches have looked at the problem of addressing secu- 
rity issues in online advertising. Privads [15] and Ad- 
nostic [47] address this problem primarily from a user 
privacy perspective. They both rely on specialized, in- 
browser systems that support contextual placement of ads 
while preventing behavioral profiling of users. In contrast, 
our work mainly focuses on a different, publisher-centric 
problem of protecting confidentiality and integrity of pub- 
lisher and user-owned content. Our work is also aimed 
at providing compatibility with existing ad networks and 
browsers. 


Restricting content languages There have been a num- 
ber of works [9, 6, 28, 29, 30, 12] in the area of Java- 
Script analysis that restrict content from ad networks to 
provide security protections. These works focus on limit- 
ing the JavaScript language features that untrusted scripts 
are allowed to use. The limitation is enforced statically 
by checking the untrusted script and ensuring it conforms 
to the language restrictions. Only those language features 
that are statically deterministic and amenable to analysis 
are allowed. Since much of the policy enforcement is 
done statically, these solutions typically have good run- 
time performance. In the cases of FBJS [9] and AD- 
safe [6], untrusted scripts are allowed to make calls to 
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an access-controlled DOM (document object model) in- 
terface, which incurs some overhead but affords additional 
control. The cost in employing a restricted JavaScript sub- 
set is that ads authored by many advertisers may not con- 
form to this subset, and therefore require re-development 
of ad script code. In contrast, ADJAIL neither imposes the 
burden of new languages nor places restrictions on Java- 
Script language features used in ad scripts. The only effort 
required from a publisher that incorporates ADJAIL is to 
specify policies that reflect site security practices. 


Code transformation approaches Many recent ap- 
proaches [37, 53, 22, 14, 34, 10, 35] have been pursued to 
transform untrusted JavaScript code to interpose runtime 
policy enforcement checks. These works cover the many 
diverse aspects by which third-party content may subvert 
policy enforcement checks. Since these works are aimed 
at general JavaScript security, they are not specialized 
to the problem of securing ads for publishers, where the 
main issue 1s ensuring transparent interposition. This is to 
avoid any conflict with ad targeting and billing strategies 
employed by ad networks. The recommended method of 
transforming JavaScript dynamically by a publisher in- 
volves using a proxy (e.g., for handling scripts sourced 
from an external URI). However, routing all ad script 
HTTP requests through a script-transformation proxy may 
appear suspicious to click-fraud detection mechanisms [2] 
employed by the ad network. 


Publisher-browser collaboration An alternative ap- 
proach is for a publisher to instruct a browser to enforce 
the publisher’s policies on third-party content, leaving 
the enforcement entirely to the browser. This publisher- 
browser collaborative approach is a sound one in the 
long term to enforce a wide range of security policies 
as illustrated in BEEP [21], End-to-End Web Applica- 
tion Security [8], Content Security Policies [43] and Con- 
Script [33]. The main positives of this approach are that 
it can enforce fine-grained policies with minimal over- 
heads. The primary drawback is that today’s browsers 
do not agree on a standard for publisher-browser collab- 
oration, leaving a large void in near-term protection from 
malicious third-party content. 


3 Architecture 


Let us revisit our running example of a publisher who 
wishes to carry ads on a webmail application. Recall 
that the publisher embeds an ad network’s JavaScript code 
within the HTML of the webmail page to enable ads. In 
the benign case, this JavaScript code scans the webmail 
user’s email message body to find keywords for contex- 
tual ad targeting, then dynamically loads a relevant ad. For 
simplicity, we refer to the ad network’s JavaScript and an 
advertiser’s JavaScript (the latter loaded dynamically by 
the former) as the ad script. This section gives a high 
level overview of how we prevent the ad script from per- 
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forming a variety of attacks against the publisher and end 
user. 

Our approach is to initially confine the ad script to a 
hidden isolated environment. The hidden environment is 
locally and logically isolated [27, 44] as opposed to re- 
quiring additional physical and remote resources [31]. We 
then detect effects of the ad script that would normally be 
observable by the end user, had the script not been con- 
fined by our approach. These effects are replicated, sub- 
ject to policy-based constraints, outside the isolated envi- 
ronment for the user to observe and interact with. User 
actions are then forwarded to the isolated environment to 
allow for a response by the ad script. Thus we facilitate 
a controlled cycle of interaction between the user and the 
advertisement, enabling dynamic ads while blocking sev- 
eral malicious behaviors. 


3.1 Ad confinement using shadow pages 


As a basic policy, the publisher wants to ensure ad script 
does not access the publisher’s private script data. If 
this policy is not enforced, ad script can read the sen- 
sitive document .cookie variable and leak its contents, 
enabling the recipient of the cookie to hijack the authen- 
ticated user’s webmail session. Furthermore, ad script 
should not be allowed to read confidential user data from 
the page (e.g., email message headers and address book 
entries). Such data is normally accessible via the brow- 
ser’s document object model (DOM) script interfaces. 

To enforce the publisher’s policy, we leverage browser 
enforcement of the same-origin policy (SOP) [50], an ac- 
cess control mechanism available in all major JavaScript- 
enabled browsers. Web browsers enforce the SOP to pre- 
vent mutually distrusting web sites from accessing each 
other’s JavaScript code and data. As a script instantiates 
code and data items, the browser places each item un- 
der the ownership of the script’s origin principal. Origin 
principals are identified by the domain, protocol and port 
number components of the script’s uniform resource iden- 
tifier (URI). Whenever a script references code or data, 
both the script and item being accessed must be owned by 
the same origin, else access is denied. 

To enforce the publisher’s ad script policy, we begin by 
removing the ad script from the publisher’s webmail page. 
Next, we embed a hidden <i frame> element in the page. 
This <iframe> has a different origin URI, thus invoking 
the browser’s SOP and thereby imposing a code and data 
isolation barrier between the contents of the <iframe> 
and enclosing page. Finally, we add the ad script to the 
page contained in the hidden <iframe>. We refer to the 
hidden <iframe> page as the shadow page, and the en- 
closing webmail page as the real page. This transforma- 
tion just described is depicted in Figure 2. 

In the process of rendering the real page, the browser 
renders the shadow page, executing the ad script within. 
Our use of the SOP mechanism effectively relegates this 
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Figure 2: Relocating the ad script to a hidden shadow page 
invokes the browser’s same-origin policy for confinement. 


ad script to an isolated execution environment. All access 
by ad script to code or data in the real page will be blocked 
due to enforcement of the SOP. Furthermore, the ad script 
can not retrieve confidential address book data via DOM 
interfaces, as access to those APIs are denied by SOP. We 
can say the publisher’s basic policy is enforced, because 
(1) all such ad scripts are relocated to the shadow page, 
and (2) the browser correctly enforces the SOP. 


3.2 Controlled user interaction with ads 


Consider an ad script that loads a product image, or ban- 
ner. Normally the banner appears on the real page, but 
since the ad script runs in the shadow page, the banner 
is rendered on the shadow page instead. Without further 
steps, the webmail user viewing the real page will never 
see this banner because the shadow page is hidden. We 
now describe how the user is able to interact with the 
shadow page ad by content mirroring (§3.2.1) and event 
forwarding (83.2.2), subject to policy-based constraints 
(§3.2.3). 


3.2.1 Ad mirroring 


A detailed view of the real and shadow pages that depicts 
mirroring of ad content is shown in Figure 3. We add 
Tunnel Script A to the shadow page that monitors page 
changes made by the ad script (A), and conveys those 
changes ((2)) to the real page via inter-origin message 
conduits [1, 20]. We add complementary Tunnel Script B 
to the real page that receives a list of shadow page changes 
and replicates their effects on the real page. Thus when ad 
script creates a banner image on the shadow page, Tunnel 
Script A sends a description of the banner to Tunnel Script 
B, which then creates the banner on the real page for the 
end user to see. 


Special care is taken to prevent sending redundant 
HTTP requests to the ad server during the mirroring pro- 
cess, as such requests can interfere with an ad network’s 
record keeping and billing operations. These details are 
discussed at depth in 84.3.2. 
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Real Page Shadow Page 


Nicioct ye) mae | Ad Script 


Body 





Message 
Body 


Address Book 


Tunnel Script B 





< Ad Content 


Read-Only Access | | Read-Write Access 


Figure 3: Overview of ADJAIL integrated with a webmail ap- 
plication. Ad script is given read-only access to email message 
body for contextual targeting purposes. Ad script can write to 
designated area to right of message body. Confidential data such 
as address book and mail headers are inaccessible to ad script. 


3.2.2 Event forwarding 


Ads sometimes respond in complex ways to user gener- 
ated events such as mouse movement and clicks. To fa- 
cilitate this interaction, we capture events on mirrored ad 
content and forward these events (Figure 3, (3)) to the 
shadow page for processing. For example, if the ad script 
registers an onmousemove event handler with the original 
banner image, we register our own (trusted) event handler 
on the mirrored banner image. Our handler listens for the 
mouse-move event and forwards it to the shadow page’s 
banner via an inter-origin message. If the ad script re- 
sponds to the mouse-move event by altering the banner or 
producing new ad content, these effects are replicated on 
the real page by our mirroring strategy outlined above. 


3.2.3 Ad policies 


All messages sent between the real and shadow pages are 
mediated by our policy enforcement mechanism. This 
mechanism enforces policy rules which are specified by 
the publisher as annotations in the real page HTML. For 
the webmail example in Figure 3, the following access 
control policies are specified (shown in bold): 

1}<div id="MessageBody" 

2 policy="read-access: subtree; "> 

3 Message body text here... </div> 

4|<div id="Advertisement" 

5 policy="write-access: subtree; "></div> 


The policy in line 2 allows the ad script read-only ac- 
cess to the email message body. Read-only access is en- 
forced by initially populating the shadow page with con- 
tent from the real page (ref. Message Body regions in 
Figure 3). If ad script makes changes to read-only content, 
those changes are not mirrored back to the real page. Any 
attempts to mirror those changes to the real page message 


19th USENIX Security Symposium 


body (perhaps by a compromised Tunnel Script A) are de- 
nied. 

The policy in line 5 permits the ad script write access to 
the sidebar on the right of the email message body. This 
is the region where the ad banner is to appear. When ad 
script creates content in the shadow page sidebar, this pol- 
icy allows our mirroring logic to reproduce that content 
on the real page sidebar. 

An implicit policy restriction on all mirrored content 
is that executable script code can not be written to the 
real page. To enforce this restriction, we only mirror 
items conforming to a configurable whitelist of static con- 
tent types. Note this script injection threat is distinct 
from cross-site scripting (XSS), which the site can defend 
against using well-researched approaches (e.g., [46]). 

The full policy language (detailed in $4.1) supports 
content restrictions to block Flash, deny the use of im- 
ages (for text-only ads), restrict the size of ads, and more. 
These constraints can be tailored to the minimum compat- 
ibility requirements of individual ad networks, which we 
show in 85 can prevent attacks such as clickjacking [17]. 

Our policy enforcement mechanism is implemented on 
the real page as part of Tunnel Script B. As stated earlier, 
the ad script can not access the real page (including Tunnel 
Script B) due to SOP enforcement. Therefore ad script can 
not tamper with our policy enforcement mechanism. 


4 Implementation 


The implementation of ADJAIL is described in the context 
of a single webmail page with an embedded ad, which is 
integrated with our defense solution. We present the pol- 
icy language used to restrict ads in 84.1. Then in 84.2 we 
describe how the real and shadow pages are constructed. 
$4.3 explains how we facilitate interaction between the 
two. 


4.1 Policies 


By default, ad script is given no access to any part of the 
real page unless granted by policies (i.e., default-deny). 
An implicit policy we always enforce is that ad script can 
not inject script code onto the real page, nor execute script 
code with privileges of the real page. We now describe 
in detail the individual permissions granted by policies, 
how policies are specified, and how multiple policies are 
combined to form a composite policy. 


Permissions ADJAIL supports a basic set of permissions 
that control how ads appear on the real page and how ads 
can behave, summarized in Table 1. We define a policy as 
an assignment of values to each of the permissions. Our 
permissions have been designed iteratively by studying re- 
quirements of ads from several ad networks, and our re- 
sults presented in 85 show the supported permissions can 
be composed to form useful advertisement policies. 

The permissions read-access and write-access 
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Permission Values Description / Effects 

read-access none!*, subtree Controls read access to element’s attributes and children. 

write-access none!*, append, Controls write access to element’s attributes and children. Append is not 
subtree inherited. 

enable-images deny!*, allow Enables support in the whitelist for <img> elements, CSS 

background-image and CSS list-—style-image properties. 
enable-iframe deny!*, allow Enables <iframe> elements in whitelist. 
enable-flash deny!*, allow Enables <object> elements of type 


application/x-shockwave-flash in whitelist. 


max—-height, One. NS, nm com, Nn em, 
max-width nex,nmin, mmm, npc, 


TL pt, Th px, none! 


Sets maximum height / width of element to n units. Smaller dimensions are 
more restrictive. When composing values specified in incompatible units, 
most ancestral value wins. 


overflow deny!*, allow Content can overflow boundary of containing element if allowed. 
link-target blank*, top, any! Force targets of <a> elements to _b1lank or _top. Not forced if set to 
any. 


Table 1: Permissions that can be set in policy statements. *Most restrictive value. ‘Default value. 


control what parts of the page ad script may read from 
or write to. Of particular interest is the append set- 
ting for write-access. This level of access allows ad 
script to add child content to an element, but neither read 
nor modify existing children of the element. Any ap- 
pended children are automatically given a policy attribute 
Set to write-access: subtree;. Some ads, such as 
the clock ad (#4) in Figure 1, require the append permis- 
sion to add floating (1.e., absolutely positioned) content to 
the <body> element. In supporting these ads, we don’t 
want to grant subtree write access to the <body> ele- 
ment, as that would enable a malicious ad to overwrite the 
entire page. Granting append access in this case is safer 
as it adheres to the principle of least privilege [40]. 


Part of our policy enforcement is a whitelist of HTML 
elements, attributes and CSS properties that ad script is 
allowed to write to the real page. Although this white- 
list can be modified by the publisher at a low level, we 
support the following higher-order controls for tuning the 
whitelist. Ads are text-only by default; to enable images, 
the enable-images permission can be set to allow, 
thus expressing a publishers content restrictions policy on 
the use of third-party images. Another content restric- 
tions permission is the enable-flash permission, that 
allows Flash ads to be displayed. Since our framework 
doesn’t address security threats from opaque content such 
as Flash (82.1), a publisher must exercise severe caution 
in enabling this permission. Also <iframe> elements 
can be allowed via enable-iframe. However, allowing 
<iframe> elements can facilitate attacks such as click- 
jacking [17] and drive-by downloads [36]. 

The max-height, max—width and overflow permis- 
sions control how the ad appears on the page. If an el- 
ement’s size surpasses the max-width or max-height 
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dimension and the overflow permission is set to deny, 
then excess content is hidden. Otherwise the excess con- 
tent will overlap other parts of the page. The overflow per- 
mission is useful because some ads consume a small area 
when not in use, but may overlap non-ad content when 
engaged by the user (e.g., expanding menus). Publish- 
ers may wish to disallow expanding ads because they can 
overlap trusted page content. 


The link-target permission controls the HTML 
target attribute of all <a> elements (and <form> el- 
ements, if allowed by whitelist) in mirrored content. By 
setting this permission, the publisher can specify that ac- 
tivated links or submitted forms must be directed to a new 
browser tab / window (if set to blank), or directed to 
the tab / window hosting the real page (if set to top). 
Whether links open in the same or new window is of- 
ten agreed to between the publisher and ad network. The 
link-target permission can be used to protect the pub- 
lisher from ad script that mistakenly creates content that 
does not adhere to the agreed upon link behavior. 


Policy specification The publisher can annotate any 
HTML element of the real page with a policy at- 
tribute. The policy attribute contains a set of state- 
ments, each terminated by a semicolon. Each state- 
ment specifies the value of a particular permission in 
the form, permission: value;. Acceptable values for 
permission and value are listed in Table 1. 


Permissions granted in an element’s policy attribute 
are inherited by descendants in the HTML document hi- 
erarchy. That is, the scope of a permission P is the 
HTML subtree rooted at the element whose policy at- 
tribute grants P. 
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Algorithm 1: ComputePolicy( targetElement ) 


1 policy «+ new Object (); 
2 WABeforeAppend + undefined; 
3 foreach element from root to target Element do 
4 if policy| “write-access” | = “append” then 
E policy| “write-access” |< WABeforeAppend 


5 
6 statements <-Parse ( 
element.getAttribute ( “policy” ) ); 
foreach stmt in statements do 
policy < ComposePolicies ( policy, 
L stmt ); 


on 


9 if policy| “write-access” ] 4 “append” then 


10 g WABeforeAppend + policy| “write-access”’ ]; 

11 foreach permission in all permissions do 

12 if permission is not defined in policy then 

13 policy| permission |< GetDefaultValue ( 
permission ) ; 


14 return policy; 


Policy composition Multiple policy statements may as- 
sign different values to a single permission. This can oc- 
cur within a single policy attribute or through inheri- 
tance. We resolve the ambiguity of multiple permission 
values through a composition process. The composition 
algorithm, given in Algorithm 1, takes a target element as 
input and derives an assignment of values to each of the 
permissions listed in Table 1. 


We can describe the composition algorithm intuitively 
as follows. The effective value for a permission is the 
most restrictive value specified for that permission across 
all composed policy statements. That is, if a permission 
appears in multiple statements (either within an element’s 
policy attribute or in separate inherited policies), we 
take the intersection of all specified values for the per- 
mission. After all statements have been composed, any 
permissions left unspecified are set to their most restric- 
tive values. 


To enhance usability we introduced three minor ex- 
ceptions to the above. First, the max—-height and 
max-—width permissions default to their least restrictive 
value (i.e., none). We chose this default because a defin1- 
tive maximum height and width will not be satisfactory 
for every type of ad. It is better for each publisher to 
explicitly declare these values if such restrictions are de- 
sired. The policy semantics is still default-deny, because 
write permissions must first be granted before restric- 
tions on the size of written content can have any im- 
pact. For the same reasons, our second exception defaults 
link-target permission to its least restrictive value. 
The third exception is we prevent inheritance of append 
write permissions. This is important as append specifi- 
cally does not grant access to existing children of an el- 
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1| <script type="text/javascript"> 
2 google ad client = "pub-..."3 
3 google_ad_width = 728; 
4 google_ad_height = 90; 
5 google_ad_format = "728x90_as"; 
(a) 6 google_ad_type = "text"; 
P| 7, Sr TLS 
8| <script type="text/javascript" 
9 src="http://pagead2.googlesyndi 4] 
10 cation.com/pagead/show_ads.js" 
11| ></script> 


(b) 1 <script type="text/javascript" 
2 src="AdJail.js"></script> 


Figure 4: (a) Google AdSense ad script, removed from real 
page. (b) Tunnel Script B, added to real page. 


ement; thus any existing children should not inherit the 
append permission. 


4.2 Real and Shadow pages 


The architecture of our implementation requires changes 
to the original web page (real page) and creation of a cor- 
responding shadow page as described in 83.1. The shadow 
page is hosted on a web server having an origin different 
from the real page, thus the browser’s same-origin pol- 
icy ensures the shadow page by default has no access to 
the cookies, content or other data belonging to the real 
page. Deploying our implementation requires a publisher 
to configure their DNS and web server to support the 
shadow page origin domain. Care must be exercised in 
the selection of the shadow page domain (one for each ad- 
vertiser) in order to ensure that there is no reuse or overlap 
of domains. 

To facilitate voluntary communication between the two 
pages, we leverage the window.postMessage() brow- 
ser API. postMessage() 1S an inter-origin frame com- 
munication mechanism that enables two collaborating 
frames to share data in a controlled way, even when SOP 
is in effect [1]. 


Construction of the real page The real page is a ver- 
sion of the publisher’s original page modified in three 
ways. The first modification is to remove the ad script 
(Figure 4a). Second, we add the tunnel script (Figure 4b) 
to the end of the page. The third modification to the orig- 
inal page is annotation of HTML elements with policies, 
which we discussed at length in 84.1. Only two annota- 
tions, illustrated in 83.2.3, are required for the webmail 
example. 

The real page tunnel script has an_ initialization 
routine that first scans the real page to find all el- 
ements with policies granting the following permis- 
sions: read-access: subtree;, write-access: 
append;, and write-access: subtree;. All match- 
ing elements are converted into models (.e., JavaScript 
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1| { nodeType: "ELEMENT_NODE", 

2 tagName: "div", syncId: 0, 

3 tops Yy- Tere: he,. widths w, heights “hy 
4| attributes: { 

5 id: "MessageBody", 

6 policy: "“read-access: subtree;" 

ql) 

8 children: [ 

9 { 

10 nodeType: "TEXT_NODE", 

11 nodeValue: "Message body text here..." 
12 } 

1S l, 

14) computedStyle: { ... } 

15] } 


Figure 5: Model of MessageBody element (as defined in 
§3.2.3) sent from real page to shadow page 


data structures) that will be sent to the shadow page in 
a later stage. Script nodes are omitted from models be- 
cause we can not guarantee their semantics are preserved 
on the shadow page. An example model is shown in Fig- 
ure 5, which models the readable Message Body <div> 
element in the webmail page (corresponding HTML given 
in §3.2.3). 

Of the elements found in the initial scan, those with 
read permission are modeled by encoding (non-script) el- 
ement attributes and readable child nodes into the model. 
The remaining elements (i.e., those having write access 
but no read access) are modeled as empty containers. That 
is, any attributes and child nodes are omitted from the 
model. 

All elements with a policy annotation and their descen- 
dant elements are assigned a unique syncTId attribute dur- 
ing initialization. The sync ID is used to match elements 
on the real page with their corresponding elements on the 
shadow page as content is kept synchronized between the 
two pages. As the final step of initialization, the tunnel 
script creates and embeds the hidden <iframe> element 
for the shadow page. 


Construction of the shadow page The shadow page be- 
gins as a template web page containing only the tunnel 
script. As the template page is rendered, the shadow 
page tunnel script receives content models (described 
above) from the real page’s tunnel script. The model 
data is sent as a character string in JSON [7] syntax via 
postMessage(). Once received by the shadow page, 
models are converted into HTML constructs using the 
browser’s DOM interfaces. This results in a web page 
environment containing all the non-sensitive content and 
constructs of the real page, in which we will allow the ad 
script to execute. 

To support ads (such as inline text ads) that appear or 
behave differently depending on where content is posi- 
tioned, the shadow page is virtually sized to the dimen- 
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sions of the real page, and content models are rendered 
in the same absolute position and size of their real page 
counterpart. Position and size information is depicted in 
Figure 5 as top, left, width and height properties. 
Throughout dynamic updates these attributes are kept syn- 
chronized by an approach given in 84.3.4. 

Next, we install wrappers around several DOM API 
methods to interpose between the ad script and the 
DOM. Although ad script can circumvent our wrappers 
in Mozilla browsers by using the JavaScript delete op- 
erator [35], we do not rely on wrappers to enforce policies 
or security properties. Wrappers are used to monitor page 
updates and provide transparency with regard to the num- 
ber of impressions generated by ads, which we discuss at 
length in $4.3. 


Default ad zone Lastly, the ad script is embedded in the 
shadow page inside a container <div> element, which we 
automatically map to a corresponding <div> on the real 
page. We refer to these linked elements as the default ad 
zone. Automatic mapping is required because many ad 
scripts, such as Google AdSense, will not independently 
find and inject ads into the content imported from the real 
page. Rather they simply write ad content into the element 
containing the ad script. To support these ad scripts, the 
publisher indicates the default ad zone element on the real 
page by setting its HTML class attribute to include the 
class AdJailDefaultZone and ensuring the element’s 
policy grants subtree write access. If the real page has 
no valid and unique default zone, content written to the 
shadow page default zone will not appear on the real page. 


4.3. Synchronization 


After initial rendering of the real and shadow pages in the 
browser, the two pages are kept synchronized by exchang- 
ing the messages listed in Table 2. We conserve the total 
number of generated ad impressions, using an approach 
given in §4.3.1. Content written by ad script to the shadow 
page is mirrored to the real page by a process described 
in §4.3.2. User interface events are forwarded from the 
real page to the shadow page as detailed in 84.3.3. Lastly, 
84.3.4 describes how content position and style is kept 
synchronized on both pages as needed by some ad scripts. 


4.3.1 DOM interposition 


A primary goal of our approach is to conserve the num- 
ber of ad impressions detected by an ad server, which we 
achieve using DOM interposition. Ad networks bill ad- 
vertisers, and in turn pay publishers, based in part on the 
number of ad impressions. Impression counts are corre- 
lated to the number of requests for ad resources submitted 
to the web server [18]. When ad content is rendered on 
the real page, any external resources not available in the 
browser’s cache will be requested, causing an impression. 
This may occur for several possible reasons out of our 
control, such as: the user disabled the cache, the ad net- 
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(a) Real page to shadow page: 


DispatchEvent ( event ) 
Dispatch event to shadow page. 


SpetecrollPos( x7, Vy) 
Scroll hidden shadow page to coordinates (x, y ). 


SetStyle ( syncld, properties ) 
Set style of shadow page element identified by syncld as 
specified in properties. 


(b) Shadow page to real page: 


Initialize ( Step ) 
Initialize communication channel (two steps) 


InsertNode ( syncld, index, model ) 
Insert node described by model as child index of element 
identified by syncld. 


ModifyAttribute( syncld, name, value ) 
Set attribute name to value on element identified by syncid. 


ModifyStyle(syncld, name, value, priority ) 
Set CSS property name to value and priority on element 
identified by syncld. 


ModifyText ( syncld, index, data ) 
Set text content to data on index child of element identified by 
syncld. 


RemoveNode ( syncld, index ) 


Remove node index child node of element identified by syncld. 


ReplaceChildren( syncld, models ) 

Replace child nodes of element identified by syncld with 
children described in models. 

WatchEvent ( syncld, type, phase ) 

Register a listener for event type and phase (bubble / capture) 
on element identified by syncld. 


Table 2: Synchronization messages sent between real and 
shadow pages via DOM postMessage() API. 


work instructed the browser (via Cache-Control HTTP 
headers) not to cache a resource, or per-origin cache par- 
titioning [19] is in effect. 

Impressions will be generated when the ad is rendered 
on the real page. Therefore, when ad content is initially 
rendered on the shadow page, we must prevent the brow- 
ser from submitting HTTP requests for external resources, 
as that would cause superfluous impressions. Our imple- 
mentation supports conserving impression counts for the 
following elements in our whitelist: <img>, <iframe> 
and <object> (Flash). Additionally we conserve im- 
pression counts for background image CSS properties 
in our whitelist: background, background- image, 
list-style and list-style-image. 

To prevent ad impressions on the shadow page, we in- 
terpose on the common interfaces ad scripts use to cre- 
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ate content. First, we hook DOM object prototype in- 
terfaces [25] to prevent ad scripts from setting URI at- 
tributes. For instance, we interpose on the src property 
of HTMLImageElement objects, and getAttribute() 


and setAttribute() DOM methods. We _ also 
hook other interfaces that access URI attributes, such 
aS document.write(), document.writeln(), and 


element.innerHTML, to increase completeness and 
transparency of the interposition. 

When ad script writes a URI attribute using one of these 
APIs, we substitute the real URI value with a placeholder 
value. For write(), writeln(), and innerHTML, this 
substitution requires a character search and replace in 
HTML source code. Our current implementation of this 
operation makes use of regular expression based textual 
transformation, which works well in practice, but may not 
be very precise under all circumstances. As the purpose 
of this substitution is to conserve ad impressions, a loss 
in precision here may affect compatibility with ads, but 
not security. If more precision is required, works on in- 
browser source-to-source HTML transformation [14, 34] 
can be leveraged, at the cost of additional overhead. 

One exception we make to the above scheme is for 
<script> elements. Our interposition does not block the 
setting of src attributes for scripts, because our goal is to 
enable ad scripts to execute in the shadow page. Thus 
scripts are the only source of ad impressions from the 
shadow page. Since our policy enforcement mechanism 
prevents ad scripts in the real page, each script is created 
only once, thereby conserving the number of ad impres- 
sions. 


4.3.2 Content mirroring 


We mirror ad content from the shadow page to the real 
page using a 5-step process: (1) monitoring the shadow 
page for modifications by the ad script, (2) modeling the 
detected modifications, (3) sending the model to the real 
page, (4) enforcing policies on the model, and (5) modi- 
fying the real page to reflect the model. 


1. Monitoring the shadow page for modifications 
We monitor the shadow page for dynamic modifications 
using DOM interposition logic (introduced in 84.3.1). 
In addition to APIs that affect element attributes, we 
also hook APIs that modify the document, such as 
element .appendChild(). Whenever ad script attaches 
anew DOM node using appendChild(), our monitoring 
code is invoked before the actual modification takes place. 
Alternatively, DOM mutation events [51] can be leveraged 
to perform the same monitoring function with lower com- 
plexity than DOM interposition. However, Internet Ex- 
plorer does not yet support mutation events, which would 
result in decreased compatibility. 


2. Modeling the detected modifications When modifi- 
cations to the shadow page are detected, we encode those 
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Figure 6: Rendering an ad image only on the real page so that 
just one impression is generated. 


changes using the same model format described in $4.2 
and depicted in Figure 5. However, when we find content 
that was substituted by our interposition (ref. 84.3.1), we 
model the ad script’s intended content instead of the sub- 
stituted content. Models are passed to the real page, where 
the modifications will be reflected to the extent allowed by 
policies. 


3. Sending models to the real page The process of send- 
ing a model of an image element is depicted in Figure 6. 
In the shadow page, we serialize the model data structure 
to a JSON string. We send the serialized model from the 
shadow page to the real page using the InsertNode () 
message from Table 2b. (Other types of modifications use 
the additional postMessage() notifications listed in Ta- 
ble 2b.) On the receiving end (i.e., the real page), we de- 
serialize the string to recover the model data structure. 


4. Enforcing policies on the models Our policy enforce- 
ment code in the real page receives the model from the 
shadow page. The model is then checked for any content 
that violates the real page policy annotations. We trim 
all policy-violating content from the model. For instance, 
if the model describes an image to be added to the page 
where the enable-images permission is denied, then we 
remove the image from the model. If the model describes 
an ad that is 1000 pixels wide and the policy only allows 
the ad to be 600 px, we allow the ad but restrict its maxi- 
mum width to 600 px. 


5. Modifying the real page to reflect the mod- 
eled changes Finally we merge the changes 
represented by the model into the real page. We 
create or modify constructs using DOM APIs, such 


as document.createElement () and element 
.setAttribute(). To ensure scripts are not injected 
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into the real page during this process, we leverage 
the techniques we developed in BLUEPRINT [46] to 
enforce a no-script policy over all merged changes. 
This entails protecting several script injection vectors, 
including <script> elements, event handler attributes, 
javascript: URI schemes, CSS expressions, and 
more. 

Mirroring ad content on the real page has the side-effect 
of modifying the real page script execution environment. 
For instance, elements such as <input name="query" 
...> can pollute the namespace by creating properties 
such as document.elements.query. A straightfor- 
ward solution to this problem is disallowing name and id 
attributes on mirrored ad content; however, this may re- 
duce compatibility with some ads. 


4.3.3 Event forwarding 


To prevent code injection attacks during content mirror- 
ing, our whitelist intentionally omits event handlers such 
as onclick and onmouseover that have been attached to 
ad content. In order to preserve event handler functional- 
ity in spite of this restriction, we perform event forward- 
ing. 

Event forwarding leverages our DOM interposition 
framework. We interpose on script operations used to 
register event handlers such as handler attributes and 
object properties (e.g., onclick, onload, etc.), us- 
ing the same mechanism used for URI attributes and 
properties described in 84.3.1. Additionally, browser- 
specific APIs such as element .addEventListener () 
and element.attachEvent () are detected and inter- 
posed on when present. 

When ad script uses any of these APIs to register an 
event handler on an element, and that element is also mir- 
rored on the real page, we register our own handler for 
the same event on the mirrored element. Event handlers 
are registered on the real page when specified in con- 
tent models (InsertNode() and ReplaceChildren () 
messages), or by sending the Wat chEvent () message of 
Table 2. Whenever the event occurs on the real page, our 
handler is invoked and sends details of the event to the 
shadow page using the DispatchEvent () message (in- 
dicated by path 6) on Figure 3). On the shadow page we 
establish the appropriate JavaScript scope, then dispatch 
the event to the target element. This in turn invokes the 
ad script’s original event handler. Effects caused by the ad 
script’s handler are detected and mirrored back to the real 
page using the mechanism described in 84.3.2. 


Ad clicks Unlike other user interface events, we do not 
forward click events on <a> (link) elements. Instead we 
click (i.e., activate) links on the real page, subject to en- 
forcement of the link-target permission. This has 
the effect of bypassing any click event handlers the ad 
script may have registered on the activated link. There- 
fore there can be a compatibility trade-off in enforcing the 
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link-target permission if the ad script depends on such 
event handlers. 


4.3.4 Position and style synchronization 


Some ads mimic the appearance of a pop-up window by 
temporarily overlaying parts of the web page. Although 
the pop-up window can appear at variable locations on the 
page, typically it is positioned such that it is visible (given 
the portion of the page that is scrolled into view) and rela- 
tive to some other content (such as a contextual keyword). 
The ad script contains logic to compute the pop-up loca- 
tion based on the above criteria. However, if content ap- 
pears at a different location on the real page than it does on 
the shadow page, the pop-up will be positioned incorrectly 
when mirrored. For this reason we support synchroniz- 
ing the visual aspect of both real and shadow pages, even 
though the shadow page remains hidden. 


First, we keep the window sizes of each page synchro- 
nized by setting the shadow page size to 100% of the real 
page size. Second, we sync the scroll position of both 
pages by registering an event handler for the real page’s 
onscroll event. Whenever the event fires, we send a 
SetScrollPos() message to the shadow page. Our code 
running in the shadow page receives this message and ad- 
justs the shadow page vertical and horizontal scroll offsets 
to match the real page. 


Next we have to ensure content on the shadow page oc- 
cupies the same location and extent as the corresponding 
content on the real page. For example, consider the in- 
line text ad (Figure 1, #3), which highlights keywords and 
makes a pop-up appear near a keyword when the user’s 
mouse hovers over it. The precise location of the key- 
word depends on many things, such as the absolute co- 
ordinates of the element containing the text, height and 
width of the container element, font size of the text, di- 
mensions / layout of other content in the container, and 
more. We synchronize these details by sending the abso- 
lute position, size and computed style of each mirrored el- 
ement to the shadow page via the SetStyle() message. 
On the shadow page we apply these properties to content 
elements, while keeping record that these are not “authen- 
tic’ properties that should be synchronized back to the real 
page during any future content mirroring operations. 


This strategy works very well in practice but is not per- 
fect. For instance, there may be text in the real page that 
flows around an image. If the policy in effect for the text 
content allows read access, and the image is not readable, 
then the image will not appear on the shadow page and 
thus the text will not flow in the same way. To resolve 
issues due to the layout becoming out of sync, the pub- 
lisher can either make the image readable or customize 
the shadow page to more accurately reflect the real page. 
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5 Evaluation 


We evaluated ADJAIL to assess performance in three ma- 
jor areas. In 85.1 we investigate the compatibility of our 
architecture with six popular ad networks, each of which 
serve a variety of ads. The security of our approach is 
tested in 85.2. We then measure ad display latencies in 
$5.3. Although many ad networks exist which were not 
tested, we believe the relatively small sample we evalu- 
ated offer good insights into the compatibility and perfor- 
mance of ADJAIL. 


5.1 Compatibility 


To evaluate how well ADJAIL works with existing ad 
scripts, we tested it on six popular ad networks: Yahoo! 
Network, Google AdSense, Microsoft Media Network, 
Federated Media Publishing, AdBrite and Clicksor. The 
first four used banner ads, while the latter two employed 
more complicated inline text ads. Yahoo!, Google and Mi1- 
crosoft were three of the top ten ad networks in terms of 
U.S. market reach in April 2009. With a total audience 
size of 192.8 million, Yahoo! reached 86.6% of the mar- 
ket, Google reached 85.3%, and Microsoft reached 72.4% 
[3]. 

Federated Media, AdBrite and Clicksor rank lower in 
terms of U.S. market reach (e.g., AdBrite ranked #21 
with a reach of 47.2%), but were chosen as they repre- 
sent the small publisher market and demonstrate unique 
functionality. They are not as pervasive, therefore they 
are more likely to exhibit compatibility problems and less 
tested features. In our experiments we focused on the fol- 
lowing observations: whether the ad functioned correctly, 
the minimum permissions required to support the ad, and 
whether click and impression counts were affected by our 
approach. 

Our prototype ADJAIL implementation is a sufficient 
proof-of-concept to demonstrate the feasibility of our ap- 
proach. The prototype is designed and tested to work on 
recent releases of the Chrome, Firefox, Internet Explorer, 
Opera and Safari web browsers. It does not yet have the 
level of refinement that would be present in a production 
system, which exposes some compatibility limitations we 
describe below. 


Correct functionality To evaluate correct functionality 
we embedded ad scripts from each ad network in a series 
of ADJAIL test pages, then compared the user experience 
to the same ad scripts when used without sandboxing. The 
four banner ad scripts (Yahoo!, Google, Microsoft and 
Federated) all made use of the default ad zone feature. In 
this experiment we observed two main types of ad banner: 
animated image and Flash. 

All of the banner ads rendered on the real page with- 
out any noticeable differences from rendering the ad with- 
out ADJAIL. Interacting with Flash ads via the mouse 
and clicking on banners worked exactly the same as the 
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non-sandboxed ads. One minor issue we are aware of 
is that the contextual targeting approach used by Google 
AdSense does not work with our current implementation. 
This is because AdSense performs contextual targeting 
on the server, using an offline cached copy of the pub- 
lisher’s page. This limitation can be overcome by pro- 
viding pre-computed shadow pages to ad networks who 
perform server-side contextual targeting, like AdSense. 

For each of the inline text ad scripts (AdBrite and Click- 
sor), we annotated a news article with a full read and 
write access policy. The ad scripts identified keywords in 
the article and transformed them into interactive ads that 
“pop up” when the user hovers the mouse cursor over a 
keyword. This allowed us to evaluate the intricate syn- 
chronization capabilities of our architecture, such as ad 
script modifying existing page content and event forward- 
ing. The pop-ups consisted of a decorative window border 
around the actual advertisement. AdBrite worked well in 
this experiment; its ads were simply <iframe>s wrapped 
by the decorative border. Clicksor also worked without 
any noticeable differences. 


Minimum permissions For each tested ad network, we 
enabled the strictest set of permissions that would per- 
mit ads to function without impairment. These permis- 
sions are summarized in Table 3. To arrive at the set of 
permissions, we started with the base read and write ac- 
cess needed by the ad. We then enabled support in the 
content whitelist based on the needs of the ad. Finally, 
for fixed-size banner ads we set the maximum width and 
height policies. 

Google AdSense was configured to serve text ads, so 
we were hoping to confine it with a strict text-only pol- 
icy. Unfortunately the text ads were contained in an 
<iframe>, thus we had to set the enable-iframe per- 
mission. 

AdBrite and Clicksor needed append write permission 
on the <body> element to create their pop-ups. White- 
list customization was also required for the pop-ups, as 
they contained custom HTML elements to prevent inher- 
itance of publishers’ CSS formatting rules [4]. AdBrite 
was easier to support as we only had to whitelist their cus- 
tom <ispan> element. Clicksor used a randomly gen- 
erated element tag name consisting of the word “span” 
followed by digits (e.g., <span40110>). To accommo- 
date Clicksor we modified the whitelist to accept element 
tag names that matched the JavaScript regular expression 
/~span[0-9]{5,7}$/. Also we note that Clicksor was 
the only ad network to require <form> and <input> ele- 
ments in its whitelist. 


Click and impression counts To measure the number of 
clicks and impressions caused by ads, we configured our 
browser to route all traffic through a web proxy running 
the Squid proxy software. We rendered each ad script with 
and without sandboxing, and clicked on the displayed ads 
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in each case. For this experiment, the web page hosting 
the ad script was completely blank except for a single 
paragraph of text, which was used for rendering inline text 
ads and contextual ad targeting. 

A given ad script may show a different ad each time 
it is rendered. To ensure consistency in our evaluation, 
multiple renderings were sometimes performed for an ad 
network to ensure we clicked on the same advertisement 
with and without sandboxing. In between renderings, we 
cleared the browser’s cache to ensure proxy access pat- 
terns were not affected by prior tests. 

After performing the experiment, we analyzed the 
proxy’s access logs. We discarded all log entries that re- 
ferred back to our server hosting the test pages and AD- 
JAIL source code. Comparing the remaining log entries, 
we did not find any differences in the HTTP requests gen- 
erated by sandboxed versus non-sandboxed ads. Thus we 
conclude that in our experiment, ads using our sandbox 
environment did not impose any additional impressions 
or generate any additional clicks, thereby preserving traf- 
fic patterns crucial to the web advertising revenue model. 


5.2 Security 


To evaluate the security provided by ADJAIL we in- 
stalled the RoundCube webmail v0.3.1 software on our 
web server. We integrated two ad network scripts on the 
main webmail interface: one ad script was included di- 
rectly on the page, and the other was embedded using 
ADJAIL. A single trial consisted of replacing each of 
the two ad scripts with a malicious script designed to per- 
form one specific attack or policy violation. We then ob- 
served if the malicious script functioned correctly in the 
non-sandboxed location, and whether the attack was pre- 
vented in the sandboxed location. Several trials were con- 
ducted to assess different attack vectors, and to determine 
the least restrictive policy required to defend each vector. 

Our experiments were designed to support our claims in 
$1 of strong defense against several potent attack vectors 
to which ad publishers are routinely exposed. However, 
we did not evaluate the threats discussed in §2 that are be- 
yond the scope of our current work: drive-by downloads, 
Flash exploits, privacy attacks, covert channels, and frame 
busting. 

Results of the security evaluation are included on the 
right side in Table 3. With appropriate policies in ef- 
fect, ADJAIL blocked all of the in-scope threats. We note 
that for each ad, write access was allowed for the subtree 
rooted at the <div> element designated for ad content. 
However, every ad policy denied write access (the default 
setting) for the rest of the document. A degree of leniency 
is required in our policies for compatibility with existing 
ads, which opens the door to some of the secondary at- 
tacks. However, every ad network we tested was protected 
from our primary threats: confidential data leaks and con- 
tent integrity violations. 
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Computed Policy (Annotated policy in bold) 


Attack resistance 


Ad Network Element 

read write enable 

access access images 

AdBrite <body> none append allow 
Article <div> subtree subtree deny 

Clicksor <body> none append allow 
Article <div> subtree subtree deny 

Federated Ad <div> none subtree allow 
Meee Rest of page none none deny 
Google Ad <div> none subtree deny 
Rest of page none none deny 
Microsoft Ad <div> none subtree deny 
pe Rest of page none none deny 
Yahoo! Ad <div> none subtree deny 
Rest of page none none deny 


enable enable max max over EC ICU AO 
iframe flash width height flow XBVJ IPA 
allow deny none none deny VV VV 

deny deny none none deny Vv VV 

deny deny none none deny VV VV 

deny deny none none deny Vv VV vo 

allow allow 90px 728px deny VV Vv Vv 
deny deny none none deny V¥VVVVVVV 
allow deny 600px 160px deny VV Vv Vv 
deny deny none none deny V¥VVVVVVV 
allow allow 300px 250px deny VV Vv Vv 
deny deny none none deny V¥VVVVVVV 
deny allow 90px 780px deny VV Vv Vv 
deny deny none none deny V¥VVVVVVV 


Table 3: Policy annotations required to support several popular ad networks, and attacks prevented in policy enforcement regions. 
Attacks prevented are: EX: Execute arbitrary code in context of real page (non-XSS), CB: Data confidentiality breach, IV: Content 
integrity violation, CJ: Clickjacking, UI: UI spoofing, AP: Arbitrary ad position, OA: Oversized ad. Default 1ink-target policy 


used for all. 


Below we briefly describe our objectives and method- 
ology for testing each attack. 


Execute arbitrary code in context of real page In this 
attack we attempted to break out of the sandbox, by caus- 
ing the browser to execute ad script code in context of the 
real page. This attack is critical because, if successful, 
malicious code can disable all policy enforcement logic in 
the real page and subsequently mount any of the other at- 
tacks. Specifically excluded from this vector 1s code injec- 
tion by reflected, DOMO, and stored XSS attacks, which 
the web application can defend by other means. 

We attempted to inject script code in the real page via 
DOM traversal, but this was blocked by the browser’s 
SOP policy. Next, we evaluated 7 different real-world at- 
tacks sourced from the XSS Cheat Sheet [16]. Each at- 
tack demonstrated a unique code injection vector, such as 
embedded script element, event handler, javascript: 
URI, CSS expression, and more. These code injection at- 
tempts were blocked by enforcing a no-script policy on 
content models when constructing the mirrored ad in the 
real page, using the technique we developed in prior work 
[46]. 

To evaluate our defense against Flash-based script in- 
jection attacks, we created a Flash application that uses 
the ExternalInterface API to extract confidential 
data from the DOM. Flash regulates access to this API 
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via the allowScriptAccess attribute of <object> el- 
ements, and value attribute of <param> elements when 
the name attribute is set to allowScriptAccess. With- 
out ADJAIL, the ad network’s script can create Flash 
objects on the real page with allowScriptAccess 
set to always. This setting permits Flash Action- 
Script code to fully access the real page’s JavaScript 
environment, including sensitive page content via the 
DOM. Our defense blocks this attack vector by forc- 
ing the allowScriptAccess attribute to never on 
all <object> elements and relevant <param> ele- 
ments. This action effectively disables the Flash 
ExternaliInterface API. 

All script injection attacks were prevented even with the 
most permissive policy that can be written using our pol- 
icy language. Thus the script injection vector is defended 
for every possible policy configuration. 


Confidential information leak For this attack we re- 
trieved two items of confidential data from the real page: 
the user’s session cookie and list of email contacts. Due 
to SOP restrictions, the sandboxed attack could not ac- 
cess the information by DOM traversal. (We note DOM 
traversal is also an ineffective strategy for all remaining 
evaluated attacks.) The only way the attack could access 
confidential data was when the data was given a policy 
granting full read access. 
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Figure 7: Rendering latencies: (a) time spent loading the ad, and (b) time from start of page load until ad appears. 


Content integrity violation This attack tampers with 
trusted content on the real page: the user’s email mes- 
sage headers. Specifically the attack makes all messages 
appear to be sent by prominent government officials. The 
sandboxed attack was unsuccessful except when the mes- 
sage headers were given a policy with full write access. 


Clickjacking The clickjacking attack attempts to entice 
the user to unknowingly click on an <iframe> element. 
The attack script is based on detailed technical analy- 
sis of the vector [17, 54]. With a policy that disallows 
<iframe> elements, the sandboxed attack was unsuc- 
cessful because the policy prevents any <iframe> on the 
(hidden) shadow page from being brought up to the real 
page where the user can click it. Since any <iframe> 
embedded by the ad is unclickable to the end user, typi- 
cal tricks to mask the clickjacking attack (e.g., hiding the 
<iframe> using transparency) are not a factor. 


User interface spoofing We made an ad appear identi- 
cal to trusted webmail user interface components in an at- 
tempt to lure users into interacting with the ad (1.e., an in- 
terface spoofing attack [26]). This attack was defeated by 
denying images, <iframe>s and Flash, and further con- 
straining the ad with policies that disallow the ad from 
overlapping other parts of the trusted interface. Since the 
ad can still make use of textual elements, we note there 
exists a very small likelihood for an attacker to succeed 
through very nuanced UI spoofing attack using very small 
(single pixel) elements or text, such that images can be 
rendered in HTML one pixel at a time. Mitigating this 
threat may require advanced analysis of ad content or re- 
stricting the color palette available to ads. 


Arbitrary ad position We made an ad appear on the 
real page outside of its write-accessible container element. 
This type of violation can be performed by setting an ad 
content display position that is outside the bounds of its 
container. With a policy that denies overflow, violations 
due to out-of-bounds display positioning are blocked. Po- 
sition policies can also be violated by a node splitting at- 
tack, which may only succeed when there is no mecha- 
nism to provide hypertext markup isolation [41, 45]. Our 
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content mirroring approach provides the necessary isola- 
tion by default to prevent node splitting attacks. 


Oversized ad We made an ad larger than the publisher’s 
expected ad size. The size violation was blocked by con- 
figuring a policy to limit the maximum height and width, 
and disallowing overflow. 


5.3. Rendering overhead 


To measure ad rendering latencies incurred by our policy 
enforcement mechanism, we placed each ad script on a 
typical blog page instrumented with benchmarking code. 
There were a total of 12 instances of the blog page: for 
each of the six ad networks evaluated in 85.1, one version 
of the blog page used the original ad, and a second ver- 
sion used ADJAIL to enforce the policies in Table 3. As 
the blog page is rendered, the ad script executes and scans 
for contextual data, requests a relevant ad from the ad net- 
work based on this data, and finally renders the ad. This 
experiment reflects the typical delays a end-user would 
experience when browsing publisher pages that integrate 
ADJAIL. 


The test pages were rendered in Firefox v3.6.3 on 
an AMD Phenom X4 940 (3.0GHz) workstation with 
7.9GB RAM. To resemble a typical browsing environ- 
ment, the browser cache was enabled during the experi- 
ment. Each test page includes a link to our ADJAIL imple- 
mentation source code (102 kB of JavaScript), which was 
cached by the web browser. The code is not optimized 
for space and contains much debug code. The memory 
overhead required by ADJAIL was reasonably consistent 
across ad networks, averaging 5.52% or roughly 3.06 MB. 


Results of this experiment are shown in Figure 7. First 
we measured the time taken to render only the ad (Fig- 
ure 7a). For AdBrite and Clicksor (inline text ads), this 
measurement consists of the time between the user trig- 
gering an ad pop-up and appearance of the pop-up. AI- 
though we do not separately report the latency incurred 
by forwarding events to the shadow page (ref. 84.3.3), 
this overhead is included in Figure 7. For this experi- 
ment, we stopped the benchmark after the ad’s <i frame> 


19th USENIX Security Symposium 385 


386 


or <object> onload event was triggered, signaling the 
ad was complete. Without sandboxing, ads rendered in 
374ms on average. With ADJAIL, ad rendering averaged 
532 ms, an additional latency of 158 ms. 


To better understand the impact of ad rendering latency, 
we measured the time between when the page started 
loading until the ad completed rendering (Figure 7b). This 
is an important benchmark for ads, as many ad networks 
use a content distribution network (CDN) to improve per- 
formance in this regard [47]. For AdBrite, and Clicksor, 
we measured the time until inline text links finished ren- 
dering, although no ads are visible until the user triggers 
a pop-up. Without sandboxing, ads appear in 489 ms on 
average after the page begins to load. With ADJAIL, an 
additional 163 ms delay was incurred on average. 


Optimizing performance is an important area for fu- 
ture work. A straightforward way to improve perfor- 
mance will be to optimize our prototype implementation. 
More significant gains may be achieved by adapting our 
approach to support pre-computing policies and shadow 
pages. It may be feasible to integrate caching of poli- 
cies and shadow pages into web application templates and 
frameworks, to allow better performance without raising 
the publisher effort required to deploy ADJAIL. 


6 Conclusion 


In this paper, we presented ADJAIL, a solution for the 
problem of confinement of third-party advertisements to 
prevents attacks on confidentiality and integrity. A key 
benefit of ADJAIL is compatibility with the existing web 
usage models, requiring no changes to ad networks or 
browsers employed by end users. Our approach offers 
publishers a promising near term solution until web stan- 
dards support for confinement of advertisements evolves 
to offer solutions agreeable to all parties. 
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Abstract 

One of the main obstacles for the wider deployment 
of radio (RF) distance bounding is the lack of plat- 
forms that implement these protocols. We address 
this problem and we build a prototype system that 
demonstrates that radio distance bounding protocols 
can be implemented to match the strict processing 
that these protocols require. Our system implements 
a prover that is able to receive, process and transmit 
signals in less than Ins. The security guarantee that 
a distance bounding protocol built on top of this sys- 
tem therefore provides is that a malicious prover can, 
at most, pretend to be about 15cm closer to the ver- 
ifier than it really is. ‘To enable such fast processing 
at the prover, we use specially implemented concate- 
nation as the prover’s processing function and show 
how it can be integrated into a distance bounding 
protocol. Finally, we show that functions such as 
XOR and the comparison function, that were used in 
a number of previously proposed distance bounding 
protocols, are not best suited for the implementation 
of radio distance bounding. 


1 Introduction 


Distance bounding denotes a class of protocols in 
which one entity (the verifier) measures an upper- 
bound on its distance to another (untrusted) entity 
(the prover). In recent years, distance bounding pro- 
tocols have been extensively studied: a number of 
protocols were proposed [3, 138, 10, 19, 30, 15, 25, 
17, 12, 29] and analyzed [8, 26, 11, 23]. The use of 
distance bounding was suggested for secure localiza- 
tion [28], location verification [25], wormhole detec- 
tion [16, 27], key establishment [22, 32] and access 
control [22]. 

Regardless of the type of distance bounding pro- 
tocol, the distance bound is obtained from a rapid 
exchange of messages between the verifier and the 
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prover. The verifier sends a challenge to the prover, 
to which the prover replies after some processing 
time. The verifier measures the round-trip time be- 
tween sending its challenge and receiving the reply 
from the prover, subtracts the prover’s processing 
time and, based on the remaining time, computes 
the distance bound between the devices. ‘The veri- 
fier’s challenges are unpredictable to the prover and 
the prover’s replies are computed as a function of 
these challenges. In most distance bounding proto- 
cols, a prover XORs the received challenge with a 
locally stored value [3] or uses the received challenge 
to determine which of the locally stored values it will 
return {13, 29]. Thus, the prover cannot reply to the 
verifier sooner than it receives the challenge, it can 
only delay its reply. The prover, therefore, cannot 
pretend to be closer to the verifier than it really is; 
only further away. 


One of the main assumptions on which the secu- 
rity of distance bounding protocols relies is that the 
time that the prover spends in processing the veri- 
fier’s challenge is negligible compared to the propa- 
gation time of the signal between the prover and the 
verifier. If the verifier overestimates the prover’s pro- 
cessing time (i.e., the prover is able to process signals 
in a shorter time than expected), the prover will be 
able to pretend to be closer to the verifier. If the ver- 
ifier underestimates this time (i.e., the prover needs 
more time to process the signals than expected), the 
computed distance bounds will be too large to be 
useful. 


The challenge in implementing distance bounding 
protocols is therefore to implement a prover that is 
able to receive, process and transmit signals in negli- 
gible time. This requirement can be easily met with 
ultrasonic distance bounding implementations where 
the prover’s processing needs to be in the order of 
us. However, because ultrasonic distance bound- 
ing is vulnerable to RF wormhole attacks [16, 27], 
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its application is limited to few specific applications 
(e.g., [22]). For most applications, radio distance 
bounding is the main viable way of verifying prox- 
imity to or a location of a device. In this case, the 
prover’s processing time needs to be about 1ns which 
would, in the worse case, allow a malicious prover 
to pretend to be closer to the verifier by approx. 
15cm (assuming that the malicious prover is able to 
process signals instantaneously). Currently available 
platforms do not support such fast processing. This 
strict processing requirement has been, so far, one of 
the main obstacles for the wider deployment of RF 
distance bounding protocols and related solutions. 


In this work, we address this problem. We make 
the following contributions. We build a prototype 
system that demonstrates that radio (RF) distance 
bounding protocols can be implemented to match 
the prover’s strict processing requirements (i.e., that 
the prover’s processing time is below Ins). We use 
concatenation as the prover’s processing function 
and implement it using a scheme that we call Chal- 
lenge Reflection with Channel Selection (CRCS). 
Our implementation eliminates the need for signal 
conversion and demodulation since it does not re- 
quire that the received challenges are interpreted by 
the prover before the prover responds to them. Our 
prover is therefore able to receive, process and trans- 
mit signals in less than Ins. We design a distance 
bounding protocol that uses concatenation, imple- 
mented with CRCS, as the prover’s processing func- 
tion and we analyze its security; we base this proto- 
col on Brands and Chaum’s original distance bound- 
ing protocol [3]. 

We further show that processing functions such as 
XOR and the comparison function, that were used 
in a number of proposed distance bounding proto- 
cols, are not best suited for the implementation of 
radio distance bounding. The main reason is that, 
although XOR and comparison can be executed fast, 
these functions require that the radio signal that car- 
ries the verifier’s challenge is demodulated, which, 
with today’s state-of-the-art hardware, results in 
long processing times (typically > 50ns). The de- 
sign and implementation of the distance bounding 
protocol based on concatenation shows that the use 
of functions which require that the prover demod- 
ulates (interprets) the verifier’s challenge before re- 
sponding to it is not necessary for the implementa- 
tion of distance bounding. 

To our knowledge this work is the first to propose 
a realizable distance bounding protocol using radio 
communication, with a processing time at the prover 
that is low enough to provide a useful distance gran- 
ularity. 
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The rest of the paper is organized as follows. In 
Section 2 we describe the basic operation of distance 
bounding protocols. In Section 3, we discuss prover’s 
processing functions and their appropriateness for 
the implementation of radio distance bounding. In 
Section 4 we describe the design of our distance 
bounding protocol and in Section 5 we analyze its 
security. In Section 6 we present our implementa- 
tion and our measurement results. In Section 7 we 
discuss related work and we conclude in Section 8. 


2 Background on Distance Bounding 
Protocols 


Distance bounding protocols were first introduced 
by Brands and Chaum [3] for the prevention of 
mafia-fraud attacks on Automatic Teller Machines 
(ATMs). The purpose of Brands and Chaum’s dis- 
tance bounding protocol was to enable the user’s 
smart-card (verifier) to check its proximity to the 
legitimate ATM machine (prover). 

The core of all distance bounding protocols is the 
distance measurement phase (shown in Figure 1), 
wherein the verifier measures the round-trip time 
between sending its challenge and receiving the re- 
ply from the prover. More precisely, the verifier 
challenges the prover with a b-bit freshly generated 
nonce N, (typically b = 1). Upon reception of the 
challenge, the prover computes a response f* (N,), 
and sends it to the verifier. This process is repeated 
k times. After the challenge-response exchange the 
verifier verifies the authenticity of the replies (in this 
step distance bounding protocols differ) and mea- 
sures the time tY — t’ between the challenge and 
the response. Based on the measured times, the ver- 
ifier estimates the upper-bound on the distance to 
the prover. The time ¢/’ — t/’ between the recep- 
tion of the challenge and the transmission of the re- 
sponse at the prover is either negligible compared to 
the propagation time t! —tY or is lower bounded by 
the prover’s processing and communication capabil- 
ities 6, ie., tf —if > 6. 

After the execution of a distance bounding pro- 
tocol the verifier knows that the prover is within a 
certain distance, namely: 


V_4V 
ie ae, 
2, 

where 6 is the processing time of the prover (ideally 
0) and c is the propagation of the radio signal. 

Although the designs of distance bounding pro- 
tocols differ [3, 13, 10, 19, 30, 15, 25, 17, 12, 29], 
given their common distance measurement phase, 
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Figure 1: The distance measurement phase of distance bounding protocols consists of a rapid exchange of 
messages where the verifier measures the round-trip time between sending its challenges and receiving the 


replies from the prover. 


their security relies on the same underlying ideas. 
We briefly summarize them here. Distance fraud at- 
tacks [3], in which the prover tries to pretend to be 
closer to the verifier, are prevented by the follow- 
ing: (i) the prover cannot generate the reply before 
it receives the challenge and (ii) the duration of time 
the verifier accounts that the prover will process the 
reply is not longer than the prover’s actual process- 
ing time. The Mafia-fraud (or man-in-the-middle - 
MITM) attack [9], by which an attacker convinces 
the verifier that the prover is closer than it really 
is, is prevented since the attacker cannot predict ex- 
changed challenges/replies and since it cannot sped- 
up the propagation of messages (the messages prop- 
agate at the speed of light over a radio channel). 
Given this, the attacker cannot shorten the distance 
measured between the verifier and the prover. 

Distance bounding protocols therefore provide the 
verifier with an upper-bound on its physical distance 
to the prover. 


3 Functions Appropriate for Distance 
Bounding Realization 


As discussed in Section 2, one of the main assump- 
tions on which the security of distance bounding 
protocols relies is that the time that the prover is 
allowed to spend in processing the verifier’s chal- 
lenge is negligible compared to the propagation time 
t# —t” of the signal between the prover and the ver- 
ifier. In most applications, the prover’s processing 
time would therefore need to be around lns. This 
would, in the worse case, allow a malicious prover to 
pretend to be closer to the verifier by approx. 15cm 
(assuming that the malicious prover is able to pro- 
cess signals instantaneously). Such short processing 
time cannot be achieved with existing platforms. 
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The main challenge is therefore to design dis- 
tance bounding protocols which use prover process- 
ing functions f(N,) that can implemented such that 
they can be executed in < 1lns. Before presenting a 
function that is well suited for this purpose, we first 
discuss functions that were used in distance bound- 
ing protocols that are proposed in the open litera- 
ture. 


The first (obvious) candidate processing functions 
are various encryption functions, hash functions, 
message authentication codes and digital signatures; 
the use of digital signatures for this purpose was pro- 
posed by Beth and Desmedt in [1]. The use of such 
functions would largely simplify the design of dis- 
tance bounding protocols; it would be sufficient to 
use well studied challenge-response authentication 
protocols [2] where the verifier would measure the 
round-trip time between the issued challenge and the 
received response. However, the processing time for 
these functions even with the fastest available im- 
plementations by far exceeds the required processing 
time. 


In [3] Brands and Chaum proposed a distance 
bounding protocol that uses XOR as a processing 
function. In this protocol the prover XORs the ver- 
ifiers challenge with the value that the prover wants 
to transmit back and sends the result back to the 
verifier. ‘The main reasoning behind this choice was 
that XOR is a fast operation and that it should be 
feasible to execute it within the required process- 
ing time. Hancke and Kuhn [13] propose a distance 
bounding protocol where the prover, based on the 
verifier’s challenge chooses from which of the two lo- 
cal registers it should send a value back. Again, one 
of the main reasons for choosing this function was 
that such a function (comparison and access) can be 
executed fast. 
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Although XOR and comparison can be executed 
fast, these functions require that the radio signal 
that carries the verifier’s challenge is converted from 
an analog to a digital signal (ADC) and demodu- 
lated. Only when it is demodulated, the challenge 
can be used by the prover in an XOR function or 
for the selection of the register. Equally, in or- 
der to communicate the reply back to the verifier, 
the prover needs to modulate the signal and con- 
vert it from the digital to the analog signal (DAC). 
These steps, signal detection, ADC /DAC conversion 
and signal modulation/demodulation, increase the 
provers processing delay by approx. 170ns [24], not 
including possible RX /TX switching costs!. The im- 
plementations of an XOR or of a comparison func- 
tion that require the signals to be digitalized and de- 
modulated therefore require such processing which, 
using today’s state-of-the-art hardware, is not suf- 
ficiently fast to meet the security requirements of 
distance bounding protocols. Even if some process- 
ing steps can be sped-up or removed, the prover will 
still need a way of (reliably) detecting if it received 
a challenge that corresponds to a bit ”0” or a bit 
”1”, which requires some processing and thus reduces 
the security guarantees of the protocol. Namely, 
every nanosecond of additional processing in the 
implementation of the prover means that a mali- 
cious prover with a faster implementation shorten 
the measured distance even further. 

In what follows, we show that the choice of a con- 
catenation function as the prover’s processing func- 
tion, when implemented using a scheme that we call 
Challenge Reflection with Channel Selection (CRCS) 
eliminates the need for signal conversion and demod- 
ulation since it does not require that the received 
challenges are interpreted by the prover before the 
prover responds to them. ‘The prover, implemented 
using CRCS is therefore able to receive, process and 
transmit signals in less than Ins. 


3.1 Prover: Concatenation Imple- 
mented using Challenge Reflec- 
tion With Channel Selection 


In this section we describe our implementation of 
concatenation as the prover’s processing function. 
Bit concatenation CAT : N,|t] x N,|2] — rl] = 
N,|t||| Np [2] takes as input the verifier’s challenge bit 
N, |i] and the prover’s input bit N,|¢] and returns a 
two-bit reply r[t] = N,|2]||N,|i]. CAT is therefore 


We are not aware of the radio design that can perform 
these operations faster. 
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Figure 2: ‘The verifier measures the time between 
sending a challenge signal c(t) and receiving the re- 
ply signal r(t) = r,(t) + ro(t). If c(t) = r(t), the dis- 
tance bound to the prover is then given by (t,—to)-c, 
where c is the speed of light. 


given by the following table. 
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3.2. Verifier: Calculation of the Dis- 
tance Bound 


In order for concatenation to be useful for dis- 
tance bounding, we implement it by Challenge Re- 
flection with Channel Selection. Our implemen- 
tation uses three (non-overlapping) communication 
channels. ‘The verifier sends its challenge bits to 
the prover using one communication channel (Co), 
whereas the prover replies using two communication 
channels (C1,C2) (Figure 2). While it is receiving 
the verifier’s challenge bit (i.e., the signal that en- 
codes it), the prover is responding with the same 
signal (bit), but it is sending it on either channel C 
or channel C5, depending on its current input bit 
N,|t]. For every challenge bit that it received from 
the verifier, the prover therefore transmits two bits 
of the reply back to the verifier, encoded in the form 
of the signal (it reflect back the same signal that it 
received) and of the response channel (it chose the 
channel on which to reply). The response r = 10 is 
then interpreted as: the challenge bit 1 is reflected 
on channel C, where the channel C; denotes bit 0, 
and channel C2 denotes bit 1). The prover therefore 
implements challenge reflection with channel selec- 
tion. Note that, although the prover replies with two 
bits for each challenge bit, the duration of transmis- 
sion of those two bits is the same as for a single 
bit of the verifier’s challenge, since the second bit of 
the prover’s reply is encoded in the form of channel 
selection. ‘This is illustrated on Figure 2. 

The schematic of our prover implementing CRCS 
is shown on Figure 3. The figure shows the signal in 
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Figure 3: Schematic of the prover (i.e., of the imple- 
mentation of concatenation as its processing func- 
tion using CRCS). The figure shows the signal in 
the frequency domain at various stages of the cir- 
cuit. The challenge-signal (with center frequency 
f-) is received by the receiving antenna (on the left) 
and multiplied by fa. This multiplication shifts the 
signal by +fa to the channels on two sides of the 
original channel. The bit of the prover’s nonce N,|@ 
determines which of the two channels is used to send 
the response on the transmitting antenna (on the 
right). 


the frequency domain as it passes through various 
stages of the prover’s circuit. ‘The prover receives 
the challenge-signal (centered at the frequency f,) 
on the receiving antenna. ‘The received signal is then 
multiplied by fa which creates two signals on two 
channels each with central frequencies f. + fa and 
fc — fa, respectively. The current bit of the prover’s 
nonce NV,,|2] determines which of the two channels are 
used to send the response signal on the transmitting 
antenna. The verifier’s signal is thus reflected back 
on the channel selected by the prover. Here, the 
verifier’s challenge bit can be encoded in the chal- 
lenge signal using e.g., Pulse Amplitude Modulation 
(PAM) or Binary Phase Shift Keying Modulation 
(both of which are used with Ultra-Wide-Band rang- 
ing systems). The prover’s response carries two bits, 
one encoded in the signal that it sends back (the 
same bit that it received by the verifier), and the 
other encoded in the channel on which it responds 
(i635, N54): 

Here, signal multiplication and selection are done 
using analog components only. Namely, the chal- 
lenge signal passes through an analog mixer where 
it is multiplied with a local oscillator signal with a 
frequency fa. This mixer outputs two signals on 
frequencies f.+ fa and f.— fa, which are separated 
by a high-pass and a low-pass filter, respectively. Fi- 
nally, the N,|¢] bit (which the prover have commit- 
ted to), determines which of the two signals will be 
transmitted back to the verifier. 

Figure 2 shows the calculation of the distance 
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bound by the verifier (the signals are shown in the 
time domain). The verifier notes the exact time 
to when it starts transmitting the challenge bits 
N, |i], ...Ny[k] encoded in the signal r;(t), and then 
listens on the two reply channels C, and C (that 
correspond to the frequencies f. + fa and f. — fa). 
When a reply comes back (e.g., on channel C) the 
verifier will mark the exact time t, of the arrival of 
the signal. The verifier will then wait for the arrival 
of the entire challenge, noting for every time slot on 
which channel the reply was sent. After the entire 
nonce has been received and processed by the radio, 
the verifier checks that the data bits in the reply are 
the same as those sent in the challenge, i.e., that 
c(t) = ri(t) + ro(t). If that is the case, the distance 
bound is then computed as (t, —to)-c, where c is the 
speed of light. This bit comparison is important for 
the security of our distance bounding protocol (as 
we detail in Section 4); it can be efficiently done us- 
ing autocorrelation, which can then simultaneously 
be used to calculate the time difference (e.g., as it is 
used in GPS [20]). 


4 Distance Bounding Realization 


In this section we present our distance bounding pro- 
tocol and its realization. ‘The protocol uses concate- 
nation implemented using CRCS as the prover’s pro- 
cessing function. The main security properties that 
we want our protocol to achieve are resilience to dis- 
tance fraud and Mafia fraud attacks. 

Our protocol is shown in Figure 4. It closely 
resembles the original protocol of Brands and 
Chaum [3], except that it does not use rapid bit ex- 
change, but instead uses full duplex communication 
with signal streams. XOR is replaced with the con- 
catenation function, and additional checks by the 
prover and the verifier are added to make sure the 
implementation of concatenation using CRCS does 
not introduce vulnerabilities. 

The prover starts the protocol by picking a fresh 
(large) nonce N,. The prover then sends a commit- 
ment to the nonce (e.g., a signed hash of the nonce) 
to the verifier. Already now, the prover will activate 
its distance bounding hardware and set the output 
channel according to the opposite of the first bit of 
the nonce N,. From this moment, any signal that 
the prover receives on channel Cp will be reflected on 
the output channel that is set. However, the prover 
does not yet start switching between output chan- 
nels. 

Upon receiving the commitment, the verifier picks 
a fresh (large) nonce N, and prepares to initiate the 
distance bounding phase in which it will measure 
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P (Prover) 
Pick Np 


V (Verifier) 


sign(commit(N>p)) 
$= 


r — CRCS(Ny, N,) 


Pick JN, 


——— Record At 


N, <— channel(r) 


sign(V,Ny, Nv) 
> 


Verify At, N., Np, sign(V, Np, Nv) 


Figure 4: RF distance bounding protocol. 


the distance bound to the prover. The verifier starts 
a high precision clock to measure the (roundtrip) 
time of flight of the signal and begins to transmit 
his nonce N, on channel Co. From this point on, the 
verifier will also listen on the two reply channels C; 
and Cy» and will keep listening on the two channels 
until he either receives the expected response from 
the prover or until he detects an error and aborts 
the protocol. 





As soon as the prover receives (and demodulates) 
the first bit of N, on Co, he starts switching re- 
ply channels according to the bits of his nonce Np. 
Here, we note that while the first few bits are being 
demodulated, the prover is still reflecting the input 
(challenge) bits, but he did not start the switch- 
ing of the channels (i.e., he did not start sending 
back N,,). The demodulation of the bits is not done 
within the distance bounding hardware (that we call 
the distance bounding extension), but is done in the 
prover’s regular radio. It is not important how long 
it takes for the prover’s radio to demodulate the 
first bits, since the prover does not need to begin 
to switch the output channels within any predefined 
time (as long as the switching starts within the du- 
ration of N, and allows the transmission of N,). 
Equally, the first part of N, could be known and 
constitute a public, fixed-length preamble upon the 
detection of which the prover would start switching 
the channels (i.e., would start sending N,.). 

When the prover starts sending N,, he will send 
the bits of N, with a fixed frequency (e.g., ev- 
ery 500ms) by switching channels depending on the 
value of the current bit (Figure 2). In each interval, 
the prover will therefore reflect back several bits of 
N, and a single bit of N,. The bit of N, is encoded 
in the choice of the reply channel. The prover will, 
in parallel, also receive the challenge on channel C 
using his regular radio and will demodulate it. 


When the verifier has sent all the bits of his nonce, 
he waits for the prover to complete the reflection of 
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the signal and then both the prover and verifier dis- 
able their distance bounding extensions. The ver- 
ifier can then use an auto-correlation detector like 
the ones used in GPS receivers [20] to determine the 
exact time of flight of the reflected signal. This can 
also be done during the distance bounding phase, 
i.e., In parallel to the analog distance bounding cir- 
cuit. 

After the (time-critical) distance bounding phase 
is complete the prover sends a signed message con- 
taining his nonce N,, the identity of the verifier V 
and the verifier’s nonce N,, to the verifier. The ver- 
ifier must then check five things: 


e That all the bits of N, reflected by the prover 
are of the same width (time duration). This 
is necessary to prevent mafia fraud and is de- 
scribed in more detail in Section 5.3. 


e The data that was reflected back from the 
prover must be exactly the same as what was 
sent. I.e., when the signal r(t) = r1(t) + ro(t) 
is demodulated, the message must contain N,. 
This is visualized in Figure 2. 


e The value of N, obtained during the distance 
bounding phase must match the commitment 


sent in the first protocol message. 


e The signature of the final message must be valid 
and it must correspond to the expected identity 
of the prover. 


e The time of flight of the signal At must be less 
than some predefined upper limit tya:. The 
upper limit is application dependent. E.g., it 
can be the radius of some region of interest, or it 
can be the (estimated) maximum transmission 
range of the radio. 


The order is which these checks are performed is 
not important but all checks must pass for the dis- 
tance bound to be accepted. If all the checks pass, 


USENIX Association 


the verifier calculates the distance to the prover as 


A= At — Op | C (1) 
2 
Where c is the speed of light and 0, is the very small 
processing delay of the prover. In our implementa- 
tion 6, < Ins resulting in a maximum error on about 
15cm. 


5 Security Analysis 


In this section we analyze the resistance of our pro- 
tocol to distance fraud and mafia fraud, as well as 
attacks against CRCS. 


5.1 System And Attacker Model 


We consider three nodes, the prover P, the verifier 
V and the attacker M. The goals for the three par- 
ticipants are as follows: the verifier wants to acquire 
an upper bound on the distance to the prover, i.e., 
the verifier wants to know that the prover is closer 
than a certain distance. The prover wants to prove 
to the verifier that he is within a certain distance. 
The goal of the attacker is to disrupt this process 
such that the verifier obtains an incorrect distance 
bound. The verifier holds an authentic public key 
of the prover. The attacker and the prover do not 
collude. The attacker corresponds to the standard 
Dolev-Yao attacker that controls the network and 
thus can eavesdrop on all the communication be- 
tween the prover and the verifier, can arbitrary in- 
sert and remove messages to/from the communica- 
tion channel. She is equally free to transmit nonsen- 
sical signals. The attacker knows the public param- 
eters of the distance bounding protocol and the type 
of hardware used by the nodes and thus the process- 
ing times of the prover’s and verifier’s radios. She is 
only limited by the fact that it does not have access 
to the secrets that are held by the prover and the 
verifier and cannot break cryptographic primitives. 

We consider two attacks: Distance fraud, where 
the prover tries to shorten the measured distance 
bound, and Mafia fraud where the attacker tries to 
shorten the bound (but does not collude with the 
prover). We show that our protocol resists both 
attacks. There is a third type of attack in which 
the attacker colludes with the prover and has access 
to some, but not all, of the secret key material of 
the prover (e.g., only nonces and short-term secrets). 
This attack is often called the terrorist attack. We 
do not specifically address terrorist attacks, but it 
has been shown [4] that if needed, distance bound- 


USENIX Association 


ing protocols can be extended to generally protect 
against this attack. 


5.2 Distance Fraud 


Distance fraud is an attack performed by a malicious 
prover and consists of the prover trying to shorten 
the distance measured by the verifier. 

The verifier uses equation (1) to calculate the dis- 
tance to the prover. For the prover to “shorten” 
the distance to the verifier (without actually mov- 
ing closer) he must manipulate the verifiers calcula- 
tion and the only thing the prover can influence is 
At. For the prover to reduce the At measured by 
the verifier, thereby reducing the distance, he must 
make his replies arrive at the verifier sooner than 
they otherwise would, i.e., he must guess the correct 
reply (i.e., guess the challenge) and send it before 
the verifier expects. In our protocol, the reply which 
the prover must send back is the signal he receives 
on channel Cp. In order to do this, the prover must 
guess the content of the challenge signal since the 
content of the reply is checked by the verifier as a 
part of the verification process. The content of the 
challenge is N, and the probability of successfully 
guessing that is given by STNCT- 

Attacks that rely on manipulation of the modula- 
tion scheme, e.g., “late commit” attacks described by 
Hancke and Kuhn [14] will not work on this protocol 
because the verifier uses auto-correlation to find the 
exact time-of-flight of the signal (as it is done in GPS 
receivers [20]) rather than using a peak or energy de- 
tector. ‘This means that any manipulation done to, 
say, the first symbol of the response will not have any 
effect unless all subsequent symbols are also shifted 
forward. This would require the malicious prover to 
guess all the symbols in advance and can therefore 
only be done with negligible probability of Sit: 

The same argument applies to attacks where the 
prover tries to guess the first bit of the nonce [8]. 
Because the prover doesn’t store and forward the 
nonce, but instead must reflect it directly, the prover 
would have to guess all the bits of the verifier’s nonce 
to perform the attack. We can therefore conclude 
that the prover can commit distance fraud only with 
probability =—. 


QINv| 


5.3. Mafia Fraud 


Mafia fraud is an attack performed by an external 
attacker that physically resides closer to the verifier 
than the prover. The attack aims to make one of 
the parties (either the prover or the verifier or both) 
believe that the protocol was successfully executed 
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when, in fact, the attacker shortened the distance 
measurement. ‘The requirement that the attacker be 
closer to the verifier than the prover is only necessary 
because, if the attacker is further away the attack is 
trivially defeated by the protection against distance 
fraud attacks. 

In order for an external attacker to shorten the dis- 
tance measured by the verifier, the attacker must re- 
spond before the prover during the distance bound- 
ing phase. However, because of the checks performed 
by the verifier at the end of (or during) the distance 
bounding phase, it is not sufficient to just reply be- 
fore the prover, the attacker must also make the 
value of his nonce match the commitment sent by 
the prover in the beginning of the protocol. Since 
the attacker can not find a nonce to match the com- 
mitment sent by the prover, e.g., find a collision for 
the hash function used to generate the commitment, 
the attacker is forced to replace the provers com- 
mitment with his own, thereby passing the commit- 
ment check. However, the attacker cannot fake the 
prover’s signature in the final message so he cannot 
confirm the nonce. 

The attacker can get the prover to reply before 
the prover receives N,, e.g., by sending his own early 
signal to the prover, however, this will result in the 
prover getting Nj, # N, which will be detected by 
the verifier in the final message. This assumes that 
any malicious change to the signal will result in a 
change in the demodulated nonce N,. If that can- 
not be guarantied, e.g., because of the sample rate at 
the prover or the modulation scheme used for com- 
munication, the prover can record the raw incoming 
signal and send it back to the verifier. The verifier 
can then, e.g., use autocorrelation to make sure the 
signal received by the prover is the same as what the 
verifier sent. 

We can therefore conclude that an attacker can 
only commit mafia fraud if he can break, either the 
commitment scheme or the signature scheme used in 
the protocol. 

Because of the way the distance bounding radio 
extension is designed it is possible for an attacker 
to get the current bit of the provers nonce. As ex- 
plained in Section 3.1, the prover’s radio extension 
will shift any signal that arrives on the center chan- 
nel to either channel C; or channel Cy depending on 
the current bit of the provers nonce. An attacker 
can exploit this to get the current bit of the prover’s 
nonce without the prover’s knowledge. If the at- 
tacker sends a very weak signal, e.g., a DSSS [21] 
signal with a spreading code known only to the at- 
tacker, the attacker can determine what channel the 
response is sent back on, and therefore the current 
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Figure 5: Man in the middle attack (Mafia fraud). 
The figure shows the timing of the messages sent by 
the verifier (V), the attacker (M) and the prover (P). 
Even if the attacker is able to learn the value of the 
first bit on the prover’s nonce, the attack will fail 
because the attacker is forced to make the first bit 
longer than the subsequent bits if he wants to reply 
early. 


bit of the prover’s nonce. Unless this is prevented, 
the attacker can use this information to perform a 
successful mafia fraud attack. 

In order to prevent this attack the prover must 
make sure not to expose all the bits of his nonce 
before they are needed. There are two ways this 
can be ensured: Either the prover must only en- 
able his distance bounding hardware once he is sure 
that the verifier has started his transmission or he 
must make sure that his reply bits (of N,) are of ex- 
actly the same duration. Of course the time duration 
must also be known and later checked by the veri- 
fier. Our protocol uses the second method. Figure 5 
illustrates how this measure prevents the attack. In 
the example of this figure the attacker obtains the 
value of the first bit of the provers nonce, and uses 
it to reply early to the verifier’s challenge. However, 
because the prover doesn’t expose the second bit of 
his nonce until after the duration of the first bit has 
expired, the attacker is forced to make the first bit 
‘too long’, thus getting detected. 

In order to perform this attack, the attacker would 
need to guess all the bits of N,, which she can do 
only with the probability SINGT" 


6 Implementation and 
Measurements 


In this section, we describe our implementation of 
the prover and the related measurement results. 
Our prototype can be seen on Figure 6. The cen- 
tral part of the prototype is the mixer (1) which 
is responsible for shifting the received challenge up 
and down in frequency. The signal from the receiv- 
ing antenna comes in from the right (A) and passes 
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Figure 6: ‘This picture shows the prototype imple- 
mentation of the prover. It consists of a mixer (1), a 
high-pass filter (2), a low-pass filter (3), four ampli- 
fiers (4) (only two visible), a 1dB attenuator (5) and 
a terminating resistor (6). The signal from the re- 
ceiving antenna (A) is mixed with the local oscillator 
(B) and sent to the transmitting antenna (C). The 
yellow wires are power (+5V). This prototype is an 
implementation of the scheme described in Figure 3. 


through four amplifiers (4) to bring it up to a power 
level where is can be mixed by our mixer. The lo- 
cal 500MHz sine wave used for the mixing, comes 
in from the bottom of the figure (B) and is passed 
through a 1dB attenuator (5) to bring it to the same 
level as the radio signal before mixing. The output of 
the mixer is split in two and each is passed through 
either a high-pass filter (2) or a low-pass filter (3) to 
eliminate the unwanted channel. In this prototype 
we did not implement the switching mechanism. In- 
stead channel C» is fed directly to the transmission 
antenna (C). In order for the signal to split properly, 
both sides must have a similar load. for this reason 
we added a 502 resistor (6) to terminate the unused 
channel C;. The implementation of the switching 
mechanism can be done using a simple transistor 
based switch. We note, that the switch can only 
marginally increase the processing delay since, once 
set to a particular channel, the switch essentially 
acts as a piece of very short wire connecting the 
setup to the antenna. ‘This prototype is an imple- 
mentation of the scheme described in Figure 3. 


6.1 Delay At The Prover 


We first wanted to see if our prototype implementa- 
tion could receive a signal, shift it to another channel 
and transmit it back to the verifier in < Ins. 

In order to test this, we first transmit the chal- 
lenge and response signals through cables so as to 
better be able to control signal strength and reduce 
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Figure 8: Processing time at the prover. The ten 
different delay measurements where done using our 
measurement setup described in Section 6.1. ‘The 
figure shows that the variation in processing time is 
small (o = 61.22ps) and that the average processing 
delay is up = 912.92ps. Le., less than Ins. 


noise (later we show that the same setup works using 
wireless communication as well). The challenge sig- 
nal sent on channel Co is a 3.5GHz sine, modulated 
by a 1Hz pulse so it is easy to see and capture the 
start of a new “bit”. Our response signal is sent back 
on channel C2 at 4.0GHz (ie., fo = 3.5GHz and 
fa = 0.5GHz). We generated the 3.5GHz challenge 
using a function generator. The generated signal is 
split by a power splitter and one end is fed, via a 
1 meter cable, into our prototype. The other end 
was connected to a 40Gs/s oscilloscope, via another 
1 meter cable, to provide the ground truth signal to 
which we compare the delay of our prototype. Be- 
cause both cables have the same length, the 3.5GHz 
signal (the challenge) will arrive at the same time 
at the oscilloscope and at the reception point of our 
prototype. The output (the response) from the pro- 
totype is plugged directly into another input of the 
same oscilloscope (keeping the signal path as short 
as we could make it using this setup). 


Figure 7(a) shows the two signals. The top (yel- 
low) signal is coming directly from the function gen- 
erator. It is an exact copy of the signal that arrives 
at the input of our prototype (this signal arrives 
at the oscilloscope and at the prototype input at 
the same time). The bottom (green) signal is what 
comes out of our prototype implementation. It is a 
4.0GHz signal, i.e., the original signal shifted up by 
500MHz. We see that the difference in arrival times 
between these two signals (i.e., the processing time of 
the prover) is 0.888ns. As described in Section 2 the 
delay at the prover determines the theoretical advan- 
tage a powerful attacker might get. If we translate 
0.888ns into distance, the maximum theoretical dis- 
tance by which an attacker will be able to shorten 
its distance is about 12cm. 


We repeated this measurement 10 times, using the 
same setup. Figure 8 shows all 10 measured process- 
ing times along with their average value and a 95% 
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Figure 7: The delay of the prover’s distance bounding radio extension. The top signal is measured at the 
reception antenna of the provers radio and is transmitted on channel Cp at 3.5GHz. The bottom signal is 
measured at the transmission antenna and is being transmitted at the C2 channel at 4.0GHz. The delay 
between them, and thus the prover’s processing time is 0.88878. 


confidence interval. We see from the figure that the 
processing time of the prover is stable between 0.8ns 
and Ins. 

Note that if the same setup would have been im- 
plemented in an integrated circuit, the signal path 
would be a lot shorter and consequently the process- 
ing time would have been smaller. We therefore do 
not claim that our prototype is the best that can be 
achieved, rather it shows the processing time that 
can be achieved using standard SMA components. 


6.2 Wireless Implementation 


Since distance bounding protocols are primarily use- 
ful in wireless environments, in this section we show 
that our prototype equally enables distance bound- 
ing using wireless communication (instead of wires). 
The basic construction of the prover is the same as 
in the wired setup, except that the prototype input 
and output are connected to antennas. The function 
generator that generates the verifiers signal and the 
oscilloscope used to measure the round trip time are 
likewise connected to antennas. 

The result of the wireless implementation can be 
seen in Figure 7(b). Unfortunately we had to use 
SMA cables of about 1m to connect the antennas 
because of the way the antennas are mounted. In 
addition there was about .1m between the transmis- 
sion antenna and the receiving antenna. ‘his results 
in a delay introduced by the cables and the space 
between the antennas referred to on Figure 7(b) as 
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“antenna cable delay”. The output of the prototype 
was passed through a high-pass filter and the in- 
put passed through a low-pass filter to prevent the 
transmitting antenna from feeding back into the re- 
ceiving antenna. The oscilloscope used to measure 
the difference in arrival time also had filters to sepa- 
rate the ground truth signal, i.e., the signal coming 
directly from the function generator from the one 
being transmitted by the prototype. The filters al- 
lowed for a full duplex wireless channel to be created 
between our wireless prototype and the function gen- 
erator and oscilloscope. 

It should be noted that the channel switching 
mechanism of our prototype is ideal for a wireless im- 
plementation. Any wireless distance bounding pro- 
tocol needs more than one channel (i.e., full duplex) 
in order to reply as fast as possible. Encoding the 
prover’s reply in the choice of channel means that 
the solution is strait forward to apply without caus- 
ing interference between the prover and verifier. 





7 Related Work 


Distance bounding, as a concept, was first proposed 
by Brands and Chaum in [3] who introduced tech- 
niques enabling a verifier to determine an upper- 
bound on the physical distance to a prover (as sum- 
marized in Section 2). In addition, they consid- 
ered the case where the verifier also authenticates 
the prover in addition to establishing the distance 


bound. 
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Several optimizations and studies of distance 
bounding were subsequently proposed for wireless 
networks, including [28, 30, 5] and for sensor net- 
works [18, 5, 27]. Distance bounding protocols 
have also been proposed in other contexts, e.g., for 
RFIDs [13, 10, 19] and ultra wide band (UWB) de- 
vices [17, 12]. 

In [23] the authors studied information leakage 
in distance bounding protocols. A mutual distance 
bounding protocol using interleaved challenges and 
responses was proposed in [31] and in [28] and [5] 
the authors investigated the use of distance bound- 
ing protocols for location verification and secure lo- 
calization. Sastry, Shankar and Wagner [25] pro- 
posed the so-called ”in-region verification” appropri- 
ate for certain applications, such as location-based 
access control. Collusion attacks on distance bound- 
ing location verification protocols where considered 
in [7, 6]. Ultrasonic distance bounding was used for 
access control [25] and for key establishment [32]. 
In [22] ultrasonic distance bounding was further used 
for proximity based access control to implementable 
medical devices. Other attacks have been pro- 
posed against distance bounding protocols in gen- 
eral. The so-called “late-commit” attacks where pro- 
posed in [14], where the attacker exploits the mod- 
ulation scheme in order to manipulate the distance. 
Bit guessing attacks [8] that accomplish the same 
thing where also proposed. These attacks were fur- 
ther studied in practical implementations in [11]. 

Until now, most of the work done in this field has 
been theoretical. To our knowledge our work is the 
first to propose a realizable distance bounding pro- 
tocol using radio communication, with a processing 
time at the prover that is low enough to provide a 
useful distance granularity. 


8 Conclusion 


We demonstrated that radio distance bounding pro- 
tocols can be implemented to match the strict pro- 
cessing that these protocols require (i.e., that the 
prover receives, processes and transmits signals in 
< 1ns). This can be achieved using a specially im- 
plemented concatenation as the prover’s processing 
function. ‘Through this we showed that the use of 
processing functions which require that the prover 
demodulates (interprets) the verifier’s challenge be- 
fore responding to it, is not desirable or necessary for 
distance bounding. Finally, we showed that other 
processing functions such as XOR and the compari- 
son function, that were used in a number of proposed 
distance bounding protocols, are not best suited for 
the implementation of radio distance bounding. 
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The case for ubiquitous transport-level encryption 


Andrea Bittau Michael Hamburg Mark Handley David Mazieéres Dan Boneh 
Stanford Stanford UCL Stanford Stanford 
Abstract e Encryption (and key bootstrap) are too expensive to 


Today, Internet traffic 1s encrypted only when deemed 
necessary. Yet modern CPUs could feasibly encrypt most 
traffic. Moreover, the cost of doing so will only drop 
over time. Tcpcrypt is a TCP extension designed to make 
end-to-end encryption of TCP traffic the default, not the 
exception. To facilitate adoption tcpcrypt provides back- 
wards compatibility with legacy TCP stacks and middle- 
boxes. Because it is implemented in the transport layer, 
it protects legacy applications. However, it also provides 
a hook for integration with application-layer authentica- 
tion, largely obviating the need for applications to en- 
crypt their own network traffic and minimizing the need 
for duplication of functionality. Finally, tepcrypt mini- 
mizes the cost of key negotiation on servers; a server us- 
ing tcpcrypt can accept connections at 36 times the rate 
achieved using SSL. 


1 Introduction 


Why is the vast majority of traffic on the Internet not en- 
crypted end-to-end? The potential benefits to end-users 
are obvious—improved privacy, reduced risk of sensitive 
information leaking, and greatly reduced ability by op- 
pressive regimes or rogue ISPs to monitor all traffic with- 
out being detected. In spite of this, end-to-end encryption 
is generally used only when deemed necessary, a small 
fraction of when it would be feasible. 
Possible reasons for not encrypting traffic! include: 


e Users don’t care. 


e Configuration is complicated and the payoff small 
(especially when connecting to unknown sites). 


e Application writers have no motivation. 


‘Conspiracy theorists might suggest other reasons, but we won’t 
discuss those here. 
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perform for all but critical traffic. 


e The standard protocol solutions are a poor match for 
the problem. 


We believe that each of these points either is not true, 
or can be directly addressed with well-established tech- 
niques. For instance, where users actually have con- 
trol, they demonstrate that they do care about encryp- 
tion. Four years ago only around half of WiFi basesta- 
tions used any form of encryption [3]. Today it is rare to 
find an open basestation, other than ones which charge 
for Internet access. 

It is clear, though, that application writers have lit- 
tle motivation: encryption rarely makes a difference to 
whether an application succeeds. Getting it right is diffi- 
cult and time consuming, doesn’t help time to market, 
and developers are hard-pressed to make the business 
case. For server operators, too, the process can be te- 
dious. One reason people don’t use SSL is that X.509 
certificates are a mild pain both for the server administra- 
tor and, if the server administrator didn’t buy a certificate 
from a well-known root CA, for users. 

Even more important is the performance question. 
SSL is by far the most commonly deployed crypto- 
graphic solution, and it is expensive to deploy on servers. 
Where there is a need, such as for bank login or credit 
card payments, SSL is ubiquitous, but it is rarely used 
outside of web pages that are especially sensitive. The 
definition of “sensitive” has started to change, though; 
Google recently enabled SSL on all Gmail connec- 
tions [25], ostensibly as a response to eavesdropping 
in China. In part this is possible today because cryp- 
tographic hardware has become comparatively inexpen- 
sive. This trend is set to continue; the most recent gen- 
eration of Intel CPUs incorporate AES acceleration in- 
structions [8], with the potential to significantly reduce 
the cost of software symmetric-key encryption. 

Although symmetric-key encryption is unlikely to be 
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a problem, the conventional wisdom is still that it is too 
expensive to use public-key cryptography to bootstrap 
a session key for all network connections. Indeed our 
measurements show that a fully loaded eight-core (2 x 
Quad-core Xeon X5355) server can only establish 754 
uncached SSL connections per second. In fact, this lim- 
itation is due to the way SSL uses public key algorithms 
rather than anything fundamental. We will show that 
much better server performance is possible with the right 
protocol design, in part by pushing costs to the client, 
which does not need to handle high connection rates. 
Finally, there is the question of whether current en- 
cryption protocols are a sufficiently good match for ap- 
plications that do not currently use encryption. We be- 
lieve they are not, for reasons we shall highlight through- 
out the paper. However, we will describe a subtly differ- 
ent protocol architecture that we believe is a much better 
fit to the majority of applications. This is not rocket sci- 
ence; it may even be considered obvious. But we believe 
it makes a huge difference to the deployability of encryp- 
tion and consequently of authentication in the real world. 


1.1 Getting the Architecture Right 


All the commonly deployed network encryption mecha- 
nisms incorporate authentication into the protocol, even 
if, like WPA, it is as simple as requiring out-of-band 
password exchange. Indeed this is the obvious way to 
engineer things; without authentication, it is not possible 
to determine if your encrypted channel is with the desired 
party or with a man-in-the-middle. However, we believe 
that this is fundamentally the wrong design choice. 

Encryption of a network connection is a general pur- 
pose primitive; regardless of the application, the goal 
is to prevent eavesdroppers from learning the contents 
of communications. MACing of packets in a network 
connection is also a general purpose primitive; no ap- 
plication wants to accept forged or maliciously modi- 
fied packets. Authentication, however, is not general 
purpose. The mechanism used for authentication and 
the information needed to perform that authentication 
are application-specific. In practice, protocols blur this 
distinction between general purpose encryption/integrity 
and special purpose authentication. This has two conse- 
quences: 


e It tends to encourage inappropriate authentication 
mechanisms. For example, using SSL to connect to 
a bank, then simply handing the user’s password to 
the bank, when it is known that people commonly 
re-use passwords across sites. 


e It makes it hard to integrate mechanisms low 
enough in the protocol stack to really be ubiqui- 
tous. For example, adding SSL to an application re- 
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quires modifying the source code and, potentially, 
extending its application-layer protocol in a back- 
wards compatible way. 


To enable encryption and integrity checking in a gen- 
eral way for all legacy TCP applications’, this function- 
ality must be below the application layer. However it 
cannot be done cleanly any lower than the transport layer 
because this is the lowest place in the stack that has any 
concept of a conversation. There is also the practical 
consideration that encrypting below the transport layer 
will prevent NAT traversal. The clear implication is that 
embedding encryption and integrity protection into TCP 
would provide the right general-purpose mechanism; in 
fact, because TCP includes a session establishment hand- 
shake, this is simple to do in a backward-compatible way. 

To establish session keys in a general way, TCP-level 
encryption should be divorced from higher level authen- 
tication mechanisms. This suggests the use of ephemeral 
public keys to establish session keys. Such a mechanism, 
enabled by default, would provide protection against pas- 
sive eavesdroppers for all TCP sessions, even for legacy 
applications. We are not the first to suggest such “‘op- 
portunistic” encryption. Our goal, though, is to provide 
not just encryption and integrity protection, but also a 
firm foundation upon which higher-level authentication 
mechanisms can build. With the right architecture, a di- 
verse set of authentication mechanisms can be devised, 
each suitable to its own application. 

The end point we hope to establish is that all TCP ses- 
sions (and SCTP and DCCP, though we don’t discuss 
these further here) are protected against passive eaves- 
droppers, and that all applications that require authenti- 
cation should, as a side effect, enjoy protection against 
active man-in-the-middle attacks, all without duplica- 
tion of effort. Ideally, an eavesdropper cannot tell from 
watching the traffic which encrypted sessions will be au- 
thenticated. 

In this paper, we describe fcpcrypt, our implemen- 
tation of TCP-level encryption. Although the idea is 
simple, the details really matter, as we will show. We 
have validated our design by building two implemen- 
tations, one a Linux kernel module, the other a user- 
space process using divert sockets. The latter allows 
use of tepcrypt on Linux, FreeBSD, and MacOS X with- 
out modifying the kernel. Both implementations show 
excellent performance; we will demonstrate that this is 
no longer the factor preventing ubiquitous network en- 
cryption. We have also implemented application-level 
authentication protocols that use tcpcrypt to bootstrap 
authentication. These include X.509 certificate-based 
authentication, fast password-based mutual authentica- 
tion, and PAKE. Our X.509-based authentication pro- 


The vast majority of Internet applications use TCP. 
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vides security equivalent to SSL, but uses batch-signing 
to run 25 times faster. Moreover, we have implemented 
X.509 authentication inside the OpenSSL library in a 
way that preserves the same API and cleanly falls back to 
vanilla SSL when appropriate. Thus, to take advantage 
of tcpcrypt in SSL-enabled applications requires only a 
library update. 


2 Cryptographic design 


The goal of tcpcrypt is to enable the best communica- 
tions security possible under a wide range of circum- 
stances. In the absence of any authentication, when users 
browse unknown servers, they should enjoy protection 
from passive eavesdropping. Though active network at- 
tackers may still intercept and monitor communications 
(there are also legitimate reasons for this, such as trans- 
parent proxies and intrusion detection systems), it should 
be possible to detect such behavior both during commu- 
nications and afterward. Thus, tcpcrypt should virtu- 
ally eliminate the possibility of widespread eavesdrop- 
ping unbeknownst to a user population. 

When an application performs any kind of endpoint 
authentication, it must be able to leverage tcpcrypt to 
obtain stronger protection of session data. For instance, 
given a server-side X.509 certificate, the client should be 
assured of the confidentiality of the data it transmits and 
the integrity of the data it receives. Any time a user types 
a password, it should be possible to ensure the confiden- 
tiality and integrity of all data sent in either direction. 

In all cases, when tcpcrypt achieves confidentiality, it 
should also provide forward secrecy. As a final goal, 
tcpcrypt should affect performance as little as possible. 
Thus, the protocol is designed to minimize the number of 
cryptographic operations and extra round trips, subject to 
the limitations of needing to interoperate with legacy end 
hosts and middleboxes. 


2.1 Key exchange protocol 


Key exchange is the biggest challenge to tcpcrypt’s per- 
formance. Forward secrecy requires a pair of hosts to ex- 
change a secret using an ephemeral public key or Diffie- 
Hellman key exchange the first time they communicate. 
These operations are far more costly than establishing a 
TCP connection, but the cost can be asymmetric. For 
example, a single core of the server in Section 6 can per- 
form 12,243 encryptions/sec with a 2,048-bit RSA-3 key, 
but only 97 decryptions/sec. 

Servers typically communicate with more peers than 
clients do, so it makes sense for clients to shoulder most 
of the cost of key exchange. Thus, by default, tcpcrypt 
performs the expensive decryption at the client (though 
for generality, servers may opt to reverse the protocol). 
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Figure 1: Tcpcrypt connection establishment with key 
exchange (left) and session caching (right). 


Subsequent connections between the same two hosts can 
use session caching to avoid any public key operations 
at all, thereby ensuring that, for instance, an active-mode 
FTP server need not perform RSA decryptions. 

The initial key exchange works as follows. Each ma- 
chine C’ has an ephemeral public key, Ag. When C' con- 
nects to a server S for the first time, C’ chooses a random 
nonce, Nc; S chooses a random secret, Ns; the two ex- 
change the following messages, also shown in Figure 1: 


C' > S: HELLO 

S — C’: PKCONF, pub-cipher-list, [cookie] 

C'’ — S: INIT1, sym-cipher-list, Nc, Ko, [cookie] 
S — C: INIT2, sym-cipher, ENCRYPT (Ko, Ng) 


Here pub- and sym-cipher-list are used to negotiate cryp- 
tographic algorithms. The optional cookie is a SYN- 
cookie that must be echoed by the client to make it harder 
for packets from forged source addresses to trigger any 
public-key cryptographic operations in the server. This 
trade-off is at the discretion of the server; if TCP’s 32- 
bit initial sequence number (ISN) provides enough pro- 
tection against forged packets, the option space may be 
deemed better used for other purposes. 

Kc specifies the public key cipher and a pseudo- 
random function, used below. Quantities from this pro- 
tocol are then combined into a series of “session secrets” 
with a Collision-resistant Pseudo-random Function, CPF 
(currently HMAC): 


ss[0] — CPF(Ns, {Ko, No, 
cipher-lists, sym-cipher } ) 
ss|7] — CPF (ss|¢ — 1], TAG_NEXT_KEY) 
If ISNc,; and ISNs;, are TCP’s initial sequence numbers 


on the client and server for session 2, the two sides then 
compute a master secret as follows: 


mk|i] <— CPF (ss[i], {TAG_KEY, ISNc,;,ISNs;}) . 
Finally, the two sides use CPF(mk|2], 2) on various con- 
stants x to generate encryption and MAC keys (a com- 
mon technique). From this point on, all further segments 
in the TCP connection are cryptographically protected. 


Note that this full key exchange is only needed for the 
first connection between two hosts. Hosts can cache ss|7 
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for the largest 2 used till that point. Subsequent connec- 
tions between the same two hosts can use this to derive 
new symmetric keys, thereby avoiding any further public 
key cryptography and the latency of the full handshake. 


2.2 Authentication Hooks 


To gain stronger benefits from tcpcrypt, applications 
must be able to make statements about a connection— 
e.g., “All data you read from this connection is sent by 
user U’s browser,” or “Any data you write to this connec- 
tion can be decrypted only by server Y.” To make such 
statements, one must specify what is meant by “this con- 
nection” in a way that cannot be interpreted out of con- 
text. Tcpcrypt accomplishes this through session IDs. A 
new getsockopt call returns a session ID, sid[i], computed 
from the connection’s session secret ss[i] as follows: 


sid[i] — CPF (ss[i], TAG_LSESSION_ID) 


If both ends of a tcpcrypt connection see the same 
session ID, then with overwhelming probability an at- 
tacker cannot eavesdrop on or undetectably tamper with 
traffic—i.e., there has not been a man-in-the-middle at- 
tack. Two properties facilitate verification of session IDs. 
First, they need not be kept secret. Second, with over- 
whelming probability they are unique over all time, even 
if one end of a connection is malicious. Hence, a crypto- 
graphically endorsed session ID can only ever authenti- 
cate a single tcpcrypt connection. In Section 4 we discuss 
different ways applications can leverage session IDs. 


2.3 Proof of Security 


To increase confidence in tcpcrypt, we provide a semi- 
formal proof of its security. We assume that the adver- 
sary has complete control over the network, and nearly 
complete control over the users. It can choose when and 
to whom users attempt to connect, and what data they 
send, and can delay, drop, modify, and forge packets ar- 
bitrarily. Furthermore, since the session IDs sid|7] are 
not secret, we assume that the adversary knows them. 
We do not model malicious machines here, as the ad- 
versary can emulate as many of these as it wants. We 
do not model compromised machines because of space 
constraints. When we write “client” or “server” in this 
discussion, we mean a legitimate client or server. 

We guarantee the security of tcpcrypt connections only 
when the session IDs match. In this case, the guarantee 
is fairly strong: 


Definition 2.1 (Security guarantees). Suppose that users 
U, and Uz complete the tcpcrypt protocol on sockets $4; 
and S»9, and arrive at sessions with the same session ID. 
Then the following guarantees hold: 
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e The adversary has not tampered with U, and U2’s 
cipher suite choices. Assuming they have chosen a 
secure cipher suite: 


e Any packet sent by U, on socket S; (or by Uz on So) 
gives no information to the adversary other than its 
length and timing. 


e /f, after TCP reassembly, U2 receives a sequence of 
segments p1,..., Dn, then U, sent those segments 
in that order (and no segments before them), and 
similarly for segments received by U,. 


We will show that, unless the adversary has broken 
the underlying cryptographic primitives, its probability 
of violating this guarantee is very small. Specifically: 


Theorem 2.1 (Security of tcpcrypt). Suppose that an ad- 
versary A can violate the tcpcrypt security guarantee 
with probability «. Suppose that it uses m machines in its 
attack, and begins at most c connections in total. Then 
there are five simple modifications of A, running in about 
the same time as A, which aim to do the following things: 


e Find a collision in CPF. 


Break the pseudorandomness of CPF. 


Break the public-key cipher. 
Break the MAC. 


e Break the symmetric cipher. 


The sum of their probabilities of success is at least 
eer)? 


where k & 256 is the minimum of the min-entropy of a 
public key, or the length in bits of Ng or Neo. 


Proof. Define NEXT(k) :=  CPF(k, TAG_NEXT_KEY). 
Suppose that U; and U2 have the same sid, and that for 
U, it is sid|z] for some 2, where: 

ss[0] = CPF(Ng,{Kc, Nc, cipher-lists, sym-cipher }) 
sid[i] = CPF (NEXT‘(ss[0]), TAG_SESSION_ID) 


Because everything passed to CPF has a unique parse, the 
sid must have been computed by U2 in the same way— 
and in particular with the same values of Ns, No, Ko, 
the same cipher suite lists and the same cipher choice— 
or else the computation contains a hash collision. What is 
more, the Ns, Nc, and Kc values are chosen at random, 
and so with probability at least 1 — 3c?/2**! they are 
unique. For the rest of the proof, assume that this is the 
case. 

Now, each of U; and U> is either a client or a server. 
Because their Ko, No and Ng values match, they can’t 
both be clients or both be servers; without loss of gener- 
ality, say U; is the client (which generated Kc and No), 
and U> is the server (which generated Vg). 
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We will next show that this Ns remains secret. We 
first replace ENCRYPT(Kco, Ng) with an encryption of 
zero (but the client still decrypts it to Ng). If the ad- 
versary notices this, then it has broken the public-key 
cipher. After this change, Ng is only used as a key to 
CPF. Furthermore, CPF is evaluated on Ng only once by 
Uz and once by U;, with a nonce No in the other argu- 
ment; if the adversary replays ENCRYPT(Kc, Ns), then 
CPF(Ng,-) will be called with different nonces. Because 
CPF is pseudorandom, we can replace its outputs ss(0] 
with independent random values; if the adversary notices 
this, then it has broken CPF. Continuing in this manner, 
we can replace ss|z|, mk|2], sid[7] and the encryption and 
MAC keys with random values, and the adversary will 
not notice this, either. 

If the initial sequence numbers do not match, the client 
and server will arrive at different (secret, random) MAC 
keys, and so as long as the MAC is unforgeable, nei- 
ther will accept any packets at all. Otherwise since every 
packet is MACed with associated data that includes the 
64-bit extended sequence number, they must be received 
unmodified and in order. Finally, if the symmetric cipher 
is Secure against chosen-plaintext attacks, the only infor- 
mation that the adversary can learn about a segment is its 
length and timing. This completes the proof. q 


3 Integration with TCP 


Integrating tcpcrypt into TCP posed a number of chal- 
lenges ranging from the basic to the baroque. First, we 
have to extend TCP in a backwards compatible way. If a 
tcpcrypt client connects to a tcpcrypt server, encryption 
should be enabled by default, but if it is a legacy server, 
the session must fall back to regular TCP behavior. 

The same issue applies with middleboxes. Tcpcrypt 
must work through NATs, so it cannot protect the TCP 
ports. Tcpcrypt must also work correctly when faced 
with firewalls that do not understand the tcpcrypt exten- 
sions. For an example of how broken firewalls have in- 
hibited innovation, we need look no further than Explicit 
Congestion Notification (ECN). ECN should be harm- 
less to deploy—it uses TCP options in the handshake to 
negotiate the capability, then uses two bits from the old 
IP Type-of-Service field to indicate congestion, and fi- 
nally signals this in feedback using a previously reserved 
TCP flag. ECN is built into all the main modern op- 
erating systems, but is disabled by default. This is be- 
cause a small number of home gateway/firewall boxes 
crash when they see the reserved TCP flag set to one. 

This has taught us to avoid protocol changes to TCP 
that are not carried in TCP options. Firewalls might drop 
unknown options, or might completely drop packets with 
unknown extensions; a TCP extension needs to be robust 
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to either and correctly fall back to regular TCP behavior. 

Finally we risk being hoisted by our own petard. Traf- 
fic normalizers [9], as implemented in pf [10] and some 
other firewalls, enforce conservative rules on protocol 
behavior and consistency. This limits design flexibility. 


3.1 Initial TCP Handshake 


Ideally the key exchange for tcpcrypt would be per- 
formed in TCP’s three-way connection setup handshake, 
as this would add no additional network latency to estab- 
lishing encrypted sessions. We can’t quite achieve this 
for the first connection between two hosts—rather, we 
require adding information to the first four packets of the 
session, as shown in Figure |. To be backwards compat- 
ible with regular TCP, any data we can add to the SYN 
and SYN/ACK packets must fit within the TCP options 
field, which is limited to 40 bytes, some of which are 
required to negotiate other TCP functionality. This re- 
quires HELLO and PKCOMF to be small. HELLO requests 
encryption; PKCONF acknowledges the use of encryption 
and states the list of public key ciphers that can be used 
for the subsequent key exchange. Receipt of aS YN/ACK 
without PKCONF causes fallback to vanilla TCP. 

The INIT1 message cannot be small, as it must contain 
the client’s public key. The public key cannot fit into an 
option, so instead we re-purpose the data portion of one 
packet in each direction to carry it. The data payload is 
only co-opted in this way after tcpcrypt negotiation has 
succeeded, which ensures that key data never acciden- 
tally gets passed to applications by legacy TCP stacks. 
INIT2 is sent in response to INIT] in the same way. 

We use a single TCP “CRYPT” option; HELLO, 
PKCONF, INITI, and INIT2 are suboptions of CRYPT. 
This reduces the use of scarce TCP option numbers, but 
more importantly it ensures that if a middlebox is go- 
ing to remove one option, it should remove them all. 
If either host receives a TCP segment without a CRYPT 
option during session establishment, tcpcrypt falls back 
to vanilla TCP. This ensures interoperability with non- 
tcpcrypt-aware stacks and middleboxes that strip out un- 
known options. Applications can test whether tcpcrypt 
is used by calling getsockopt to request the session ID, 
which returns an error on downgraded connections. 

Tcpcrypt also incorporates a re-keying mechanism, al- 
lowing session Keys to evolve later in the connection to 
avoid using a single set of session keys for too long. 


3.2 Session Caching 


Applications such as the Web often establish more than 
one TCP connection between the same pair of hosts in 
rapid succession. When they do this, the amount of data 


3One of us sometimes regrets writing the Normalizer paper. 
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transferred per connection can be quite small—often a 
few KBytes. If we have to pay the full cost of running 
the public key operations to establish these short-lived 
sessions, tcpcrypt can become a bottleneck. Fortunately 
we can use the same solution as SSL—cache the cryp- 
tographic state from one TCP connection and use it to 
bootstrap subsequent connections. 

To do this we use two more CRYPT suboptions, 
NEXTK1 and NEXTK2, also shown in Figure 1. We can- 
not depend on the IP address in the SYN packet to locate 
the correct state because the client may have moved, or a 
different client may have acquired the DHCP lease used 
by a previous client. Thus NEXTK1 contains nine bytes 
of the next session ID, sid|z + 1]. This allows the server 
to verify that it has the correct cached state before using 
it to enable encryption. It also makes it hard for DoS at- 
tackers to flush the server’s cache by spoofing packets. 
In the event of a cache miss, the server returns PKCONF 
and the protocol falls back to ordinary key exchange. 


3.3. Protocol and Data Integrity 


Unlike SSL, one of tcpcrypt’s goals is to provide in- 
tegrity protection for the TCP session itself, defending 
against attacks that might reset the connection [5], insert 
data into it, or otherwise interfere with its progress [14]. 
To do this, tcpcrypt adds a MAC option to every TCP 
packet after the INITI/INIT2 exchange. Packets received 
with an incorrect or missing MAC are silently dropped. 

This MAC option authenticates a segment’s payload 
as well as a pseudo-header comprising most of the TCP 
header fields and options, as shown in Figure 2. We need 
to be pragmatic about which fields are covered by the 
pseudo-header. The TCP ports cannot be covered, as 
NATs re-write them. The MAC option is zeroed out in 
the pseudo-header, since it cannot authenticate itself. 

Replay attacks could present a potential issue when 
TCP’s sequence space wraps. Instead of sequence and 
acknowledgment numbers, the pseudo-header contains 
implicitly extended 64-bit values that cannot wrap. The 
acknowledgment number is fed separately into the MAC 
value, with a technique from [15], so as to improve the 
efficiency of retransmissions (which often acknowledge 
a different packet from the original). 

Extended sequence numbers also solve the problem 
that PAWS [13] was intended to solve, so an encrypted 
TCP session might omit the timestamp option. This frees 
up eight bytes of option space; if we use a 64-bit MAC 
then tcpcrypt will use no more option space than most 
modern TCP implementations. This is particularly rele- 
vant for high performance, because when TCP’s window 
is large it benefits from the robustness provided by Se- 
lective Acknowledgments (SACK) [19], and we do not 
wish to reduce their effectiveness. 
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Figure 2: A data packet using tcpcrypt. Dashed quanti- 
ties are not transmitted by TCP though included in the 
MAC, along with shaded fields. 


More subtly, we need to be careful about middleboxes 
that modify packets. If an implementation does send the 
timestamp option, tcpcrypt will normalize it to zero in 
the pseudoheader, as OpenBSD’s pf [10] modulates its 
value. All the other options that are commonly modified 
occur only in the SYN or SYN-ACK, so do not present 
a problem. Tcpcrypt does provide a secure timestamp- 
like suboption to CRYPT called SYNC. SYNC is covered 
by the MAC, but fuzzes the clock to avoid the reasons 
for which pf needs to modulate the timestamp’s value. 
Moreover, the SYNC option is only required for keepalive 
packets and during re-keying when the connection is oth- 
erwise idle. In both cases there is no need for SACK 
blocks, so the option space is less precious. 

Packets with the TCP RST bit set present the final 
challenge. For full protection, after session establish- 
ment we would prefer to drop RST packets that do not 
contain a valid MAC option. However, RST is TCP’s 
mechanism for informing one side of a connection that 
the other side no longer has any state for the connec- 
tion. Under such circumstances it is impossible for a 
legitimate host to generate a RST packet with the MAC 
option. Tcpcrypt’s default behavior is to reset the con- 
nection when receiving a RST with no MAC, so long 
as it passes the OS’s sequence number validity checks. 
However, some applications (notably BGP routing) have 
a much stronger requirement to protect against connec- 
tion resets. For these applications we support a set- 
sockopt that mandates RST packets carry a valid MAC. 
Such connections will take a long time to time out if one 
side loses state; however, applications such as BGP and 
SSH that might require such protection also typically use 
application-level keepalives to detect liveness and so tear 
down stale connections. 


3.4 Application Awareness 


Tcpcrypt serves a dual role: for legacy applications 
it protects against passive eavesdroppers; for tcpcrypt- 
aware applications it enables stronger protection, as we 
will discuss below. However, it is important to avoid a 
duplication of functionality. 

Consider a tcpcrypt-aware web browser on a tcpcrypt- 
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capable host that wishes to make an authenticated con- 
nection to a web server. The browser might prefer 
tcpcrypt because of the availability of better password 
authentication methods, but only if the web server also 
supports it. Otherwise, it wishes to fall back to SSL. 

A potential problem occurs when the client connects 
to a legacy web server process running on a tcpcrypt- 
capable host. Under such circumstances we do not wish 
to use both unauthenticated tcpcrypt and authenticated 
SSL encryption, which would be the default behavior. 
Rather, the web browser wishes the tcpcrypt negotiation 
in the SYN exchange to fail unless both the host and the 
web server process can use the tcpcrypt-based authenti- 
cation. 

To get this correct fallback behavior, the HELLO option 
includes a “Mandatory Application-Aware’ bit. When 
set, this bit indicates to the server that it must not enable 
tcpcrypt encryption unless the server application has in- 
formed the stack that it is tcpcrypt-aware. The process 
uses a setsockopt on the listening socket to do this. 
Our enhanced SSL implementation that uses this mecha- 
nism is described in Section 5.3. 

Tcpcrypt also includes a_ second “Advisory 
Application-Aware’”’ bit in both the HELLO and PKCONF 
options. This is used for each side to indicate to the 
other that the application is tcpcrypt-aware. This is used 
when applications want to perform authentication over 
tcpcrypt if the other side is also tcpcrypt-aware, but 
where it is not necessary to fall back to an unencrypted 
session if the other side is not tcpcrypt-aware. For 
example, many websites with low security requirements 
use HTTP Digest authentication. Such websites can still 
use HTTP Digest authentication over tcpcrypt (though 
we would not advise it), but if both the client and server 
applications are tcpcrypt-aware, it would be possible 
to drop in CMAC-based mutual authentication instead. 
However, the client needs to know that the server can do 
this before sending the HTTP request, and the “Advisory 
Application-Aware” bit provides this information. _ It 
is set via a setsockopt before calling connect and 
retrieved at the other side via getsockopt after the 
connection handshake completes. 


4 Authentication examples 


User authentication is an area in which there exist sim- 
ple and well-known techniques qualitatively superior to 
those in widespread use. For instance, websites typically 
request passwords be sent straight to the server. As a re- 
sult, we see many successful phishing attacks. Almost all 
of these attacks could very easily be defeated with known 
techniques, were it not for issues of backwards compat- 
ibility in protocols and user interfaces. Thus, there are 
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strong incentives to make improvements to authentica- 
tion in the web and other applications. 

To realize this shift to better authentication protocols 
we need innovation in user-interface design. Currently, 
HTTP digest authentication, while better than plaintext 
passwords, is seldom used because web developers shun 
browsers’ ugly gray popup boxes. The challenge is to 
allow some aesthetic control by web sites while simulta- 
neously ensuring password entry is unambiguously dif- 
ferentiated from web forms (or anything else accessi- 
ble by JavaScript). Tcpcrypt itself obviously cannot im- 
prove user interfaces; the aim is to ensure that when im- 
provements do happen, they can easily be integrated with 
tcpcrypt to provide security against active attackers. 

The hook tcpcrypt provides to application-level au- 
thentication is the session ID. This section gives a few 
examples of how session IDs can be used, assuming 
the ability to display certificate names and to input 
passwords from a user securely. Though these exam- 
ples require modifications to applications, such enhance- 
ments can be deployed incrementally using tcpcrypt’s 
Application-Aware bits described in the previous section. 

Note that the prevalence of weak authentication makes 
for some very low-hanging fruit. We do not claim these 
obvious and well-known fixes as contributions. Nor do 
we mean to imply that these techniques would not work 
with application-layer traffic encryption were we to en- 
hance SSL. Our point is merely to illustrate the general- 
ity of the session ID abstraction and to help substantiate 
our claim that tcpcrypt provides encryption as a general 
building block suitable for a wide range of applications. 

The key properties we rely on are that 1) if both ends 
of a connection see the same session ID, then the ses- 
sion data’s confidentiality and integrity are ensured, and 
2) session IDs are unique over all time with overwhelm- 
ing probability, even when one end of a connection is 
malicious. 


4.1 Certificate-based authentication 


One common basis for server authentication is cer- 
tificates, such as the X.509 certificates employed by 
SSL. (This model may become even more prevalent if 
DNSSEC gains widespread deployment.) In this model, 
each server S has a long-lived public key, Kg, certified 
by a trusted authority to belong to a particular common 
name and organization. The common name or organiza- 
tion can then be presented to the user to inform her of 
whom she is communicating with. 
Certificates permit a trivial authentication protocol: 


S—C: Kg, Certificate, SIGN (as Session ID) 


The server simply signs the session ID, thereby proving it 
owns one end of the connection, ensuring confidentiality 
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of messages sent by the client and integrity of those sent 
by the server. 

The problem with the above protocol is the cost of the 
SIGN function, which can be comparable to public-key 
decryption. The cost for the server to compute such a 
signature for every new client would be comparable to 
setting up an SSL connection, which is one of the fac- 
tors dissuading people from using SSL ubiquitously to- 
day. While there do exist some faster signature schemes 
(e.g., [7]), the certificate authorities may not be willing 
to endorse non-standard algorithms. 

Fortunately, there is a better approach. Heavily loaded 
servers can amortize the cost of a single signature over 
many sessions by signing a batch of session IDs. Session 
IDs are not secret, so disclosing a batch of them to each 
client is not a problem. 

Once a single session has been authenticated, the same 
pair of machines can use the existing connection to boot- 
strap authentication of other sessions using only sym- 
metric cryptography. For instance, they can exchange 
a MAC key and use it to authenticate future session IDs. 


4.2 Weak password authentication 


Often two connection endpoints share a secret. For in- 
stance, a user may remember a password, and a server 
may store some secret derived from the password. To- 
day, all too often passwords simply authenticate the user 
to the server and not vice versa. As a basic principle, 
if we deploy new authentication mechanisms, any time 
a user types a password, it should mutually authenticate 
the client and server to each other. There is simply no 
reason ever to use a password to authenticate only one 
endpoint of a communication. Even if the other end is a 
server with an X.509 certificate, the certificate may have 
been fraudulently obtained, or it may be for a “typo” do- 
main name similar enough to the desired one that the user 
doesn’t notice the error. 

When a server, S, is under severe performance con- 
straints, it can perform password authentication us- 
ing symmetric cryptography. For instance, S may 
store the secret hash value of a user’s password, h = 
H (salt, realm, password); a client C’ can query S for 
the non-secret salt, then compute fh from a user-supplied 
password. Section 6 benchmarks the following trivial au- 
thentication protocol for such settings: 


C — S: MAC (h, TAG_CLIENT||Session ID) 
S — C: MAC (h, TAG_SERVER| Session ID) 


This protocol is no more costly or hard to implement 
than digest authentication [6] (in fact, possibly easier, as 
it requires no randomness beyond that already reflected 
in the Session ID). Yet it provides better guarantees, 
namely mutual authentication of S to C’ as well as in- 
tegrity and confidentiality of all session data. The pro- 
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tocol assures both C’ and S' that the other end of the 
connection knows h. Such a guarantee is different from 
and complements that provided by certificates—i.e., that 
a server Owns a particular domain name. Domain-name 
certificates offer important protection in many contexts, 
but this session-ID-based protocol offers protection even 
when users do not remember the correct domain name. 
We note that even if an attacker hijacks DNS to 
impersonate S, our protocol is resistant to phishing 
for users with good passwords. The protocol can be 
viewed as endorsing the session ID with h; since ses- 
sion IDs are unique over time, the attacker may obtain 
MAC(h, TAG_CLIENT|Session ID), but this value is mean- 
ingless in the context of any other connection. 
Unfortunately, while the above protocol would be cat- 
egorically superior to plaintext passwords and digest au- 
thentication, we still do not advocate using it except for 
servers on which stronger authentication would require 
too much CPU time. The problem is that an attacker who 
impersonates the server to obtain the first message can 
then mount an offline dictionary attack on the password, 
leveraging the single message exchange to guess arbitrar- 
ily many passwords. Such an attack may be detectable if 
the attacker cannot crack the password in time to mount 
a transparent man-in-the-middle attack—but people are 
used to clicking reload sometimes when web sites fail 
and will not be concerned by a single connection failure. 


4.3 Strong password authentication 


Fortunately, as detailed in Section 6, any site that can af- 
ford to use SSL today can afford to use a strong pass- 
word authentication scheme with tcpcrypt. Here we 
give a simple example of a Password-Authenticated Key- 
Exchange (PAKE) protocol that that, while considerably 
more expensive than the previous weak protocol, can 
nonetheless be implemented with far less overhead than 
SSL imposes today. 

We use a protocol termed PAKE3 in [4]. The proto- 
col relies on several system-wide parameter choices: a 
group G of prime order g (on which the computational 
Diffie Hellman problem is hard); a generator g of G; two 
randomly-chosen elements of G, U and V; two crypto- 
graphic hash functions, Ho and H,, mapping strings to 
elements of Z,; and finally, another hash function, H, 
onto bit strings the size of a MAC key. At the time a user 
registers for an account, her client computes: 


™ = Ho(password, user name, server name) 

7 = H (password, user name, server name) 

ba ge 
The server stores 7) and L, but never sees 71. To au- 
thenticate a session, the client chooses a random ele- 
ment a € Z, and the server chooses a random element 
(3 € Z,. The two then engage in the following protocol: 
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C3 S: g*U™ 

S—3C: g?V™ 
At this point, both sides compute g®’. They can do this 
by computing either U~*° or V~7° and using it to re- 
vert to a regular Diffie-Hellman key exchange. Then 
both sides compute g™!”. The client can do this because 
it knows g° and 7. The server can do this because it 
knows: L = g™ and (@. Finally, both sides compute: 

h-—H (70,9, DP Ger gue ) 
Using h they complete the password authentication pro- 
tocol of the previous section, but now the order of mes- 
sages doesn’t matter (the client and server can each trans- 
mit one of these messages before receiving the other to 
reduce latency): 


S — C: MAC (h, TAG_SERVER| Session ID) 
C — S: MAC (h, TAG_CLIENT| Session ID) 


While this protocol is considerably more expensive 
than the one in the previous section, it has the benefit of 
protecting users with weak passwords; each guess at the 
password requires a separate network interaction with a 
party that knows either the password or 77 and L. More- 
over, the protocol is still cheaper than SSL (even com- 
bined with tcpcrypt key negotiation). Therefore, we be- 
lieve it is suitable for use in any application that uses both 
passwords and SSL. 

It is an open question whether we can design pass- 
word authentication protocols that are highly efficient 
at the server and offload most of the work to the 
client. However, should we devise such protocols, they 
can be deployed after the fact, without modification to 
tcpcrypt itself. The session ID abstraction nicely sepa- 
rates tcpcrypt’s confidentiality and integrity properties, 
which are solved problems, from authentication, where 
further innovation may be needed. 


5 Implementation 


To validate the protocol design and verify its perfor- 
mance, we implemented tcpcrypt in the Linux kernel. 
We also implemented tcpcrypt as a user-space daemon 
using divert sockets; this allows tcpcrypt to be deployed 
easily without requiring any kernel changes. Finally we 
implemented a range of application authentication mech- 
anisms over tcpcrypt. 


5.1 Linux Kernel implementation 


Our kernel implementation of tcpcrypt consists of a 
4,000-line loadable module and 70 lines added to the 
core Linux 2.6.32 kernel to add the necessary hooks. For 
RSA support, we ported OpenSSL v0.9.81 to the Linux 
kernel. This required about 400 lines of glue code to ex- 
port RSA as a Linux crypto module. We also exposed 
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OpenSSL’s SHA1 as we found it to perform twice as fast 
as Linux’s implementation. 

During the implementation, it became clear that 
tcpcrypt is incompatible with TCP segmentation offload- 
ing, as supported in some modern NICs. As tcpcrypt 
has to copy the packet to memory to encrypt the data 
and compute the MAC, segmenting it during this process 
does not add significant overhead. However, a server 
running so close to its performance limits that it re- 
quires segmentation offloading would likely want to dis- 
able tcpcrypt. 


5.2 Portable userspace implementation 


Our userspace tcpcrypt implementation uses divert sock- 
ets to access TCP packets entering and leaving the host. 
Firewall rules select the packets to be diverted, leaving 
the kernel unchanged. FreeBSD’s NAT (natd) is im- 
plemented this way. The main advantages of this ap- 
proach are portability and ease of deployment. Our code 
is 7,000 lines. We have tested it on MacOS X, FreeBSD 
and Linux. 

The userspace implementation is obviously slower 
than the native kernel implementation, but it is ideal for 
early deployment without support from OS vendors. If 
tcpcrypt is successful and ships in major operating sys- 
tems, it will still be a long time before older hosts are up- 
graded. The userspace implementation provides a good 
interim solution. It can also be run on middleboxes such 
as firewalls or home gateways to protect traffic to and 
from legacy local hosts against passive eavesdropping. 

The userspace implementation is more complicated 
than the kernel one as it must track connections, dupli- 
cate much of TCP’s state machine, calculate checksums 
again, and rewrite sequence and acknowledgment num- 
bers since we use some bytes of the payload for INIT 
messages. In SYNs the MSS is reduced to allow space 
to add the MAC to subsequent packets. In addition, the 
sending of application data must be delayed until the 
tcpcrypt handshake completes, which we do by modulat- 
ing the receive window. Finally, we implement IPC calls 
to provide the equivalent of getsockopt, so the applica- 
tion can extract the session ID to perform authentication. 


5.3. Integrating tcpcrypt and OpenSSL 


If tcpcrypt were enabled by default, then an SSL con- 
nection between two tcpcrypt hosts would duplicate ef- 
fort doing both tcpcrypt and SSL key exchange and en- 
cryption. Tcpcrypt’s Mandatory Application-Aware bit 
avoids this duplication. To verify this mechanism and to 
compare the full performance of Apache running SSL- 
over-tcpcrypt using batch-signing to that of vanilla SSL, 
we implemented tcpcrypt support within the OpenSSL 
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v0.9.8le library. We did not modify OpenSSL’s API or 
require applications to set specific parameters to gain the 
benefits of tepcrypt and batch-signing—our library is a 
drop-in replacement for OpenSSL. 


Our implementation uses the tcpcrypt setsockopt to 
notify the kernel that the application supports tcpcrypt, 
setting the Mandatory Application-Aware bit during the 
handshake. After the TCP handshake, either the session 
is encrypted and both sides support tcpcrypt-based au- 
thentication, or the connection has fallen back to vanilla 
TCP. The library code then queries with getsockopt to 
get the session ID. If this returns an error, it falls back to 
SSL’s handshake, otherwise it batch-signs the session ID 
and sends it to the client. 


We modified OpenSSL’s BIO layer to call the neces- 
sary setsockopt for setting the application bit. The 
SSL layer, ie, SSL_accept and SSL_connect, then 
deals with the signatures. Thus, so long as the appli- 
cation uses the BIO API, no change to the application 
is needed to use tcpcrypt-based authentication instead of 
SSL authentication. 


Things are not quite so clean if application program- 
mers manually create sockets using the BSD socket APIs 
instead of BIO, feeding them directly into SSL_accept 
and SSL_connect. These sockets will not have the nec- 
essary options set, and so tcpcrypt would disable itself 
even though the SSL library is capable. In such cases, if 
upgrading the application is not possible, then a sysctl 
could be used to set the application bit on by default on 
specific TCP ports. 


Batch signing is implemented per SSL context. A 
single worker thread (per SSL context) waits on a 
semaphore for work and batch signs all session IDs it 
finds on its work queue. The signer thread then wakes 
up all threads corresponding to the session IDs signed. 
For batch signing to work, the SSL server must be mul- 
tithreaded. We note that this implementation naturally 
scales depending on load: if a single client needs a sig- 
nature, it is produced right away; when under load, mul- 
tiple client session IDs will be batch signed to amortize 
cost. Our OpenSSL patch and batch signing code total 
700 lines of code. 


5.4 Password based authentication 


We implemented the weak password authentication 
scheme in Section 4.2 as well as the strong scheme from 
Section 4.3. The weak scheme uses CMAC-AES as 
the MAC, and employs IBM’s CMAC patch [21] for 
OpenSSL. We implemented the strong authentication 
scheme ourselves (500 lines of code) using OpenSSL’s 
built-in support for NIST Prime-Curve P-256. 
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6 Performance and compatibility 


If we are to achieve our ultimate goal of encrypting al- 
most all Internet traffic, then the cost of doing so must 
be sufficiently low that the cost/benefit trade-off makes 
sense, even when the benefits are small. What then are 
the costs of running tcpcrypt? Roughly, the performance 
cost breaks down as follows: 


e The cost of the tcpcrypt key exchange. 


e The cost of encrypting and MACing every packet 
on the wire. 


e The cost of authentication over tcpcrypt, for appli- 
cations that choose to authenticate. 


We must demonstrate that the first two are small enough 
they will not significantly degrade the performance of 
the vast majority of servers (clients are rarely the bottle- 
neck, as they handle only a few connections per second 
at most). We must also demonstrate that the third is at 
least as cheap as current deployed solutions. 

In addition, we must also demonstrate compatibility. 
Tcepcrypt must not cause connections to fail that would 
succeed without tcpcrypt. 


6.1 Connection setup rate 


Just how fast do servers need to accept connections in 
practice? It is hard to get firm numbers. YouTube gets 
1 billion hits per day [12], thus averaging about 11,500 
hits per second. Facebook currently gets about 260 bil- 
lion page views per month [20], or around 100,000 per 
second. Of course a page may require more than one 
TCP connection, but with HTTP/1.1 the number will be 
fairly small. Facebook also has over 30,000 servers [24]. 
Not all these are front-end servers, but even so it becomes 
clear that the number of connections that need to be han- 
dled per second on each server is unlikely to be more 
than a few thousand. 

To get another perspective, we can examine what an 
untuned operating system running an untuned web server 
can achieve. This tells us how default configurations per- 
form, and so what a typical server administrator might 
expect. Our test machines are eight-core (two Intel Xeon 
X5355 CPUs) running Linux 2.6.32. Each has 13 1Gb/s 
NICs connected to client hosts via a LAN. Multiple 
clients and parallel connections are needed to saturate the 
server. Untuned, these servers can handle 35,500 TCP 
connections per second in a simple connection setup and 
teardown test, or 28,400 connections per second running 
Apache serving a small static file. 

To determine the effect of tcpcrypt, first we need a 
better control experiment because the untuned numbers 
above, although typical of most real-world installations, 
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fail to fully utilize the machine, leaving some idle time. It 
took considerable tuning* to get the connection setup and 
teardown test to saturate all the cores. Such a setup is not 
realistic for normal operation, but we wish to compare 
against the best-case vanilla TCP, not one that leaves un- 
used CPU cycles. We will compare this optimized TCP 
against SSL and tcpcrypt. 

We expect tcpcrypt to slow down TCP’s connection 
throughput in two main ways. First, uncached tcpcrypt 
connections use public key operations to setup a connec- 
tion. This cost is predominantly born by clients, which 
perform the more expensive RSA decryption operation. 
We use 2048-bit RSA-3 keys in all benchmarks. 

Second, packets are MACed and thus require more 
CPU cycles and memory accesses. Even with connec- 
tion caching, which avoids the need for public key ci- 
pher operations, four out of six of the packets in an 
accept/close cycle are MACed (two ACKs and two 
FINs). We therefore expect a performance degrada- 
tion both in the uncached and cached connection cases, 
though uncached connections will be more expensive. 

We expect SSL to perform less well than tcpcrypt for 
two reasons. First, it requires more RTTs to complete a 
connection because SSL’s handshake can only start after 
TCP’s handshake. More notably, uncached SSL connec- 
tions should be much slower than tcpcrypt’s because an 
SSL server performs the more expensive RSA decryption 
operation. However SSL also authenticates the server, so 
this is not an apples-to-apples comparison. We shall ex- 
amine the cost of tcpcrypt’s authentication in Section 6.2. 


Connection rate (conn/s) 
Protocol Native Divert 


98.434 | 61SIS 


tcpcrypt server (cached) 
tcpcrypt server (uncached) 27,070 21,908 


SSL server (cached) 39, as, Dy es 
SSL server (uncached) 
| teperypt client (uncached) _ client | teperypt client (uncached) _ a 


Table 1: Connection setup rate of tcpcrypt. 






Table 1 shows the results. Both the cached case (same 
client reconnecting) and uncached case (new client, re- 
quiring public key cipher operations) are shown. The 
two columns benchmark our two tcpcrypt implementa- 
tions: the kernel one (“‘Native’’) and the userspace divert 
socket one. To get divert numbers for TCP and SSL, 


4This involved running multiple instances of the benchmark on dif- 
ferent ports to avoid kernel locks on accept. We set the affinity of 
each benchmark to one CPU, and used a different NIC per benchmark, 
with the NIC’s interrupt affinity set to the same CPU as the benchmark 
using the NIC. This resolved in optimal packet scheduling and load 
balancing that finally brought the system to zero idle time. 
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we divert all traffic to userspace and back to the ker- 
nel; although this isn’t useful, it allows us to separate 
out the different costs and see the overhead of the divert 
socket separately from additional protocol mechanism in 
the tcpcrypt userspace implementation. 

Tcpcrypt outperforms SSL in the uncached case by a 
large margin due to reversing the asymmetric RSA costs; 
the client bears this cost. Tcpcrypt’s cached performance 
is also better than SSL. We note that our kernel im- 
plementation is not fully optimized, so it may well be 
possible to get even greater performance. For example, 
we could encrypt and MAC data while copying it from 
userspace rather than doing it on a later pass. This would 
be an optimization similar to that for checksum calcula- 
tion already used in Linux. 

While TCP can be up to 41% faster for cached con- 
nections and 3.6x faster for uncached ones, we believe 
that the absolute performance numbers of tcpcrypt have 
their own merit. Recall that a heavily loaded website 
like Youtube averages 11,500 connections per second 
and tcpcrypt should be able to sustain such high load. 
Also recall that our untuned default configuration server 
can only handle 35,500 connections running the same 
benchmark of Table 1, also a target that tcpcrypt can meet 
if some sessions are cached. 

The divert socket implementation is slower than our 
kernel one due to the multiple copies needed for each 
packet, from the kernel to userspace, and then back to 
the kernel. Furthermore the userspace implementation 
needs to (wastefully) duplicate TCP functionality already 
present in the kernel such as checksum calculations and 
protocol control block lookups. However, we believe that 
the absolute performance numbers of our divert socket 
implementation are sufficient for many situations, espe- 
cially on clients, where simple installation may be a pri- 
ority over performance. 


6.2 Authenticated connection setup rate 


While Table 1 included SSL as a reference point, it can- 
not be used to directly compare the two systems be- 
cause SSL performs authentication by default and thus 
is stronger than unauthenticated tcpcrypt. As tcpcrypt 
leaves authentication to applications, we are free to ex- 
amine different authentication schemes. Our authentica- 
tion benchmarks cover: tcpcrypt with batch signing (SSL 
replacement), CMAC password-based mutual authenti- 
cation (vulnerable to offline dictionary attacks), batch 
signing combined with CMAC, and PAKE} password- 
based mutual authentication (both resistant to offline dic- 
tionary attacks). The benchmark and setup is identical 
to our previous benchmark, but with added application- 
level authentication after connection setup. We expect 
tcpcrypt with batch signing to outperform SSL when 
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Figure 3: tcpcrypt’s authenticated connection setup rate. 


batching more than one request, as RSA signatures will 
be amortized. We expect CMAC to outperform RSA- 
based authentication, because it uses symmetric cryptog- 
raphy only. Our PAKE implementation is so far unop- 
timized, but even so we expect it to be faster than RSA 
because it replaces the expensive RSA signature with a 
few elliptic-curve operations. 

Figure 3 shows tcpcrypt’s authenticated connection 
setup rate when using our kernel implementation (“Na- 
tive’) and our userspace divert socket one. Batch signing 
performs differently depending on the size of the batch 
and Figure 3(b) shows how this scales. Most of the ben- 
efits of batch signing arise even with a parameter as low 
as 100, a number of concurrent clients easily reached 
when the server is under load. Figure 3(a) clearly shows 
that there is a range of performance characteristics which 
applications may choose from. With SSL instead, ap- 
plications are forced to use relatively low performance 
one-way authentication. Clearly, one size does not fit all. 
With tcpcrypt, applications can choose any combination 
of one-way or two-way authentication and higher perfor- 
mance at lower security or lower performance at higher 
security. For example, a busy web forum might choose 
CMAC for its authentication as it requires two-way au- 
thentication and high performance, but perhaps is not so 
security-critical that it needs to thwart offline dictionary 
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Table 2: Connection setup time. 





attacks. This setup would perform 36x faster than SSL 
on uncached connections, providing stronger (two-way) 
authentication. A bank instead, might choose PAKE for 
its authentication, performing slower, but still twice as 
fast as SSL. Alternatively if a certificate is available, 
signing plus CMAC could be 24x faster than SSL and 
still resist offline dictionary attacks. A site requiring only 
one-way authentication, like a checkout from an online 
shop that does not require login, can perform up to 26x 
faster than SSL when loaded and handling over 150 con- 
current requests. Tcpcrypt with batch signing is therefore 
a viable drop-in replacement for SSL, as in all cases its 
connection setup performance is superior (we shall ex- 
amine data throughput in Section 6.4). Authentication 
adds little cost to tcpcrypt: 2% penalty with CMAC or 
28% with batch signing under load. We believe this per- 
formance to be practical for many servers. 

For most clients the performance of the divert socket 
implementation will be sufficient, providing an easily 1n- 
stalled alternative. 

Hardware is often used to offload expensive public key 
cryptography. For example, Sun’s UltraSPARC T1 has 
a Modular Arithmetic Unit for RSA, and can do 2,300 
2048-bit signatures per second using all 32 cores [18]. 
Tcpcrypt outperforms this using only eight general pur- 
pose cores, showing how careful protocol design can 
avoid the need to throw hardware (and money) at the 
problem. We argue that offloading asymmetric encryp- 
tion is no longer needed for network encryption. 


6.3 Connection latency 


Throughput is not the only important metric— 
connection setup latency is also important. We 
compare the connection setup time from the client’s 
point of view for TCP, SSL and tcpcrypt. We expect 
tcpcrypt to setup connections faster than SSL because 
tcpcrypt’s handshake requires fewer round trips. Table 2 
shows the time to establish a connection on a LAN 
(0.2ms RTT) and on a WAN (100ms RTT). 


When the connection is cached, tcpcrypt adds very 
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little delay to TCP because no extra RTTs are needed. 
Tcpcrypt does extra work to advance keys and MAC 
the ACK, hence it takes fractionally longer. SSL 
cached takes considerably longer because its negotiation 
can only start after TCP’s handshake finishes whereas 
tcpcrypt uses the three-way handshake. In the non- 
cached case tcpcrypt and SSL perform similarly on the 
LAN as RSA dominates the cost. The main difference 
is that tcpcrypt is client-limited whereas SSL is server- 
limited. On the WAN, RTT dominates; tcpcrypt costs 
one RTT more than TCP, but one RTT less than SSL as 
it needs fewer messages to complete the handshake. Au- 
thenticating an uncached tcpcrypt connection, for exam- 
ple using CMAC or PAKE, adds extra latency. 

With batch-signing there might be a concern that the 
queuing of requests to be signed might add extra latency. 
In fact this is not the case—our implementation signs 
whatever queue is available as fast as it can. Even the fact 
that tepcrypt with signing requires two RSA operations 
does not add to latency—the expensive decrypt operation 
on the client takes place in parallel with the sign opera- 
tion on the server, so negligible extra latency is required 
beyond the extra RTT needed for authentication. 

The main effect of batch signing is in fact to reduce la- 
tency as the server becomes loaded. This is shown in Fig- 
ure 4, which graphs connection latency against the num- 
ber of connections per second the server handles. As the 
load increases eventually the server saturates and the la- 
tency increases extremely rapidly. The figure shows SSL 
latency and tcpcrypt latency when the maximum batch 
size has been artificially limited to 1, 5 and 10. SSL and 
tcpcrypt with a batch size of one are indistinguishable on 
this graph, so we only plot one line. It is clear that when 
the server has CPU cycles to spare, the batch size has 
no adverse effect on latency. In fact, quite the reverse— 
batching reduces the variance (the plot shows 10" and 
go percentiles as error bars), because short-term varia- 
tions in arrival rate map into variation in batch size rather 
than variation in CPU load. More importantly, allowing 
larger batch sizes allows the server to saturate much later, 
and so maintain this low latency across a much wider 
range of server workloads. 


6.4 Data transfer rates 


We now account for the cost of symmetric encryption 
and determine the maximum data throughput one can ex- 
pect with tcpcrypt. We benchmark data throughput when 
transmitted with TCP, tcpcrypt and SSL. To fully satu- 
rate the CPU we ran one benchmark program per core 
and NIC pair, setting the affinity of the benchmark and 
NIC to a particular core. Otherwise, packet scheduling 
was suboptimal resulting in idle time. We expect SSL 
and tcpcrypt performance to be similar as both are do- 
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Figure 4: Latency as connection rate increases. 


Transfer Throughput (Mbit/s) 
Protocol Native Divert 


12,9544 5,357 





tcpcrypt AES-SHA1 3,968 a2 
SSL AES-SHA1 3,692 1,939 


Table 3: tcpcrypt’s data throughput. 


ing AES128 and HMAC-SHAI1. Obviously, vanilla TCP 
will be fastest as it need not encrypt or MAC. 

Table 3 shows the data throughput of tcpcrypt, for our 
kernel implementation (‘Native’) and our userspace di- 
vert socket one. We were unable to saturate the CPU 
on the TCP benchmark (11% idle time) as we saturated 
all available NICs on the server. Tcpcrypt outperforms 
SSL by 7.4%. This was unexpected as the two essen- 
tially perform the same tasks: AES and SHAI. We are 
using different implementations for AES (Linux’s ker- 
nel vs. OpenSSL) though we found the two to perform 
similarly when benchmarked individually. The funda- 
mental differences between tcpcrypt and SSL are that 
SSL must do its own data segmentation and encapsula- 
tion (in addition to TCP’s) thus needs more work than 
tcpcrypt. SSL MACs at a message boundary which 
can span multiple packets, whereas tcpcrypt must MAC 
once per packet. Tcpcrypt is MACing slightly more data 
as it includes packet headers, though the cost of SSL’s 
message encapsulation seems to outweigh the additional 
bytes MACed by tcpcrypt. Overall, however, CPUs are 
powerful enough to fully encrypt a one Gigabit link, and 
in fact even more. Client machines seldom have more ca- 
pacity than that, and even our userspace implementation 
provides sufficient performance for those cases. 

Most relevant to servers, higher rates are possible 
by using faster ciphers and MACs; tcpcrypt achieves 
7,486Mbit/s using Salsa20/12 and UMAC. High-speed 
AES is possible too now that AES-enhanced CPUs are 
becoming ubiquitous, like Intel’s Westmere CPU [8], 
Sun’s UltraSPARC T2 [2] and VIA’s processors [1]. On 
a dual-core 3.33GHz desktop 15 with a 10Gb/s NIC, 
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Apache, static content (req/s) 
Protocol Native Divert 


60,156 27,196 


tcpcrypt (cached) 42,440 20,034 
SSL (cached) 19,787 12,063 


Table 4: Apache performance serving static content. 





tcpcrypt performed 8,835Mbit/s using AES-UMAC, 
even without TCP segmentation offloading and optimiza- 
tions in tcpcrypt. As an experiment, we were able to 
saturate 10Gb/s by using jumbo frames or by overclock- 
ing the box to 3.75GHz. We thus soon expect CPUs that 
will permit 1OGig AES networking—in fact, this is likely 
possible today if a six-core server 15 is used. 


6.5 Application performance: Apache 


We now study the overhead of tcpcrypt when used in a 
real application. We test the Apache web server (v2.2.11) 
serving a 44 byte static file. This setup has low ap- 
plication overhead, emphasizing overhead imposed by 
the networking stack. With a default configuration, our 
server can handle 28,400 requests per second though the 
CPUs remain unsaturated. To fully saturate CPUs, we 
must run multiple Apache instances, each on a different 
TCP port, serving traffic on a different NIC. Based on 
our microbenchmarks, we expect tcpcrypt to outperform 
SSL and have lower performance than TCP. We do not 
perform any authentication on this tepcrypt benchmark, 
so SSL provides stronger guarantees in this case. How- 
ever, aS discussed earlier, authentication can be added to 
tcpcrypt at a relatively low cost if needed. 

Table 4 shows the results of our Apache benchmark. 
Because real-world web traffic is a mix of new and re- 
turning clients, connection setup can quickly become a 
bottleneck for SSL. Tcpcrypt, on the other hand, main- 
tains a high connection rate (31% of native TCP) even 
for new clients. Note also that the case of small, static 
files is a worst-case benchmark for connection setup. We 
tried benchmarking WordPress, a more CPU-intensive 
application. Neither tepcrypt nor SSL caused a measur- 
able slowdown. This test demonstrates that ubiquitous 
encryption is feasible when the application is the bottle- 
neck, and in most cases even if it is not. 


6.6 Compatibility 


Incremental deployment is one of our chief goals. Es- 
sentially this entails gracefully falling back to TCP so 
that connections are guaranteed to succeed. Users will 
not enable tcpcrypt if doing causes their connections to 
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fail. Tcpcrypt falls back gracefully so long as packets 
with the CRYPT option do not get dropped. Otherwise, 
tcpcrypt might indefinitely send SYN packets with the 
CRYPT option, and the connection would fail when it 
would succeed using a virgin SYN packet. To gauge 
whether this is a problem, we initiated tcpcrypt connec- 
tions to the top 10,000 sites listed on Alexa. Specifically, 
we sent a SYN with the CRYPT-HELLO option set, ex- 
pecting to get a SYN-ACK back. If not, we considered 
the packet dropped. We retransmitted SYNs to detect 
packet loss. This gives a rough estimate of how many 
connections would fail because of tcpcrypt. 

Of the Alexa top 10,000 sites, we found 15 (0.015%) 
that did not respond with a SYN-ACK to a tcpcrypt SYN. 
Of these, three were on the same network. Given such a 
low failure rate, we are optimistic that tcpcrypt will work 
most of the time and can be safely deployed. However, 
by default, tcpcrypt will try to revert to standard TCP in 
case it does not receive a SYN-ACK after sending a few 
tcpcrypt SYNs to ensure reachability. 

We do not expect tcpcrypt to suffer ECN’s fate in 
terms of compatibility. ECN used reserved bits in the 
TCP header which would trigger IDSs and cause unde- 
fined behavior. Instead, tcpcrypt uses options as dic- 
tated by TCP’s specification and is not anomalous in 
any way—for instance, even during re-keying the proto- 
col design ensures that retransmissions always produce 
the same payload bytes for a given range of sequence 
numbers. We thus believe that tcpcrypt can safely be 
deployed on today’s Internet as it will, for the majority 
of users, provide stronger security without breaking con- 
nections or noticeably reducing performance. 


7 Related work 


We categorize related work based on the networking 
stack layer it operates in. The network layer is domi- 
nated by IPSec-based solutions. IPSec [16] encrypts all 
data above the network layer. However, IPSec has not 
enjoyed widespread deployment and use, so a reasonable 
fear is that tcpcrypt could endure the same fate. Fortu- 
nately, several factors make it easier to deploy tcpcrypt 
and provide greater incentive to do so, leaving us some 
hope that ubiquitous encryption can succeed at the trans- 
port layer even if it has not at the network layer. 

A big challenge to [PSec is that it breaks middleboxes 
that require access to the transport layer. Given the in- 
creasing prevalence of NAT in particular, this excludes 
a large portion of the population from using IPSec. 
Tcpcrypt, by contrast, operates at the transport layer and 
so avoids these problems. Another challenge for IPSec 
is that it is hard to create a notion of a “session” in a 
connection-less environment (the network layer). Thus, 
while IPSec is good at authenticating hosts to one an- 
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other for purposes such as virtual private networks, it 
would be difficult and cumbersome to authenticate indi- 
vidual users, processes, and connections between hosts. 
Moreover, some transport-level security issues, such as 
protecting against wrapped acknowledgment numbers, 
are harder to reason about in IPSec. 

Conversely, there are several incentives for deploying 
tcpcrypt that have no analogue with IPSec. One is that it 
can be integrated in a backwards-compatible way with 
SSL and significantly increase performance. By con- 
trast, SSL over IPSec would require double-encryption, 
reducing performance. Second, TCP multipath requires a 
means of authenticating the same endpoint with multiple 
IP addresses, which tcpcrypt makes much easier. That 
said, tcpcrypt is less general than IPSec, which encrypts 
everything above IP, including UDP. 

Better Than Nothing Security (BTNS) [26] is [IPSec 
without a PKI, thus providing no security guarantees 
against active attackers. This is similar to default 
tcpcrypt. However, tcpcrypt additionally exposes the 
necessary hooks so that applications can perform au- 
thenticate in a variety of ways to guarantee security. 
Opportunistic encryption using IKE [23] specifies how 
to use IPSec with certificates obtained from DNSSEC. 
Tcpcrypt would need application support to integrate 
with DNSSEC. 

We found no privacy solutions integrated into the 
transport layer. There are, however, integrity solutions. 
TCP MD5 [11] and AO [27] provide authentication and 
integrity protection within TCP. Tcpcrypt provides more 
functionality than these options by providing encryption. 
Moreover, tcpcrypt is fundamentally different as it re- 
quires no user setup. The session is established using 
ephemeral keys, and authentication can happen over the 
session itself. TCP MD5 and AO require establishing 
pre-shared secrets through out-of-band means. The main 
use of TCP MD5 and AO is to protect manually con- 
figured BGP sessions, which tcpcrypt can do as well by 
disabling unauthenticated RST packets. Also, TCP AO 
does not interoperate with NATs (which is okay for its in- 
tended use, as BGP is not usually spoken through NATs). 

The dominant encryption solution above the transport 
layer is SSL [22]. Tcpcrypt offers a number of bene- 
fits over SSL, including better server performance, in- 
trinsic forward secrecy, and integrity protection for the 
TCP session itself. Tcpcrypt is also more general, as it 
supports arbitrary authentication mechanisms and does 
not require a PKI. Finally, tcpcrypt is backward com- 
patible with legacy applications and legacy hosts, which 
should ease ubiquitous deployment. Being more general, 
tcpcrypt can be used as a drop-in replacement for SSL, 
and we have in fact produced an SSL library that falls 
back to SSL if tcpcrypt is unavailable. 

ObsTCP [17] also aims to provide opportunistic en- 
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cryption, but is only designed to provide security in ag- 
gregate, not for specifically targeted connections. The 
author states, “We continue to advocate TLS as the only 
user facing transport security,’ meaning ObsTCP will 
duplicate encryption done by TLS, not protect transport 
headers, and not integrate with application-level authen- 
tication. ObsTCP requires no new TCP options and no 
extra round trips for connection setup, but the downside 
is that applications must be modified and that the first 
connection between two hosts remains unencrypted un- 
less one knows that the other supports ObsTCP. 

While tcpcrypt combines only well-known techniques, 
no other existing protocol can accomplish all of its 
goals. Specifically, tcpcrypt can be incrementally de- 
ployed on today’s Internet, works out-of-the-box (even 
through NATs) without manual configuration, provides 
high enough performance to be on by default, and allows 
applications to integrate transport-layer security with ar- 
bitrary higher-level authentication techniques. The Inter- 
net demands higher security, hardware is ready for it, and 
the cryptographic techniques were waiting to be pieced 
together; tcpcrypt does so, and we believe our evaluation 
shows it could be readily deployed. 


$ Conclusion 


Tcpcrypt demonstrates that ubiquitous encryption of 
TCP traffic is technically feasible on modern hardware. 
By leveraging the asymmetry of common public key c1- 
phers, it is possible for a server to accept and service 
around 20,000 tcpcrypt connections per second without 
session caching. Even higher rates are possible with 
caching. Data transfer rates are not an issue either; AES- 
SHA] encryption and integrity protection can be done at 
several gigabits per second without hardware support on 
2008-era hardware. The newest Intel CPUs incorporat- 
ing AES instructions are even faster—tcpcrypt can reach 
9Gb/s using AES-UMAC on a dual-core 15 desktop, sug- 
gesting that six-core 15 servers should handle 10Gb/s. 
These results suggest that tcpcrypt should have a neg- 
ligible impact on the vast majority of applications. 

The main contribution of this work is not performance, 
though this is a prerequisite. There are no new crypto- 
graphic primitives, nor is the protocol especially novel. 
The main contribution is from putting well-understood 
components together in the right way to permit rapid and 
universal deployment of opportunistic encryption, and 
then providing the right hooks to encourage innovation 
and deployment of much better and more appropriate 
application-level authentication. This ability to integrate 
transport-layer security with application-level authenti- 
cation largely obviates the need for applications to en- 
crypt their own network traffic, thereby minimizing du- 
plication of functionality. 
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As an example, we showed how a simple batch- 
signing server-authentication scheme can _ leverage 
tcpcrypt to provide forward secrecy and the same se- 
curity as SSL while handling 25 times the connections 
per second. At the same time, the protocol allows an 
SSL server to fall back gracefully to regular SSL behav- 
ior when one or the other side cannot utilize tcpcrypt for 
authentication. 

We also demonstrated the use of tcpcrypt to bootstrap 
both weak and strong password-based mutual authentica- 
tion (using CMAC and PAKE respectively). Password- 
based authentication without mutual authentication, even 
over SSL, really should be a thing of the past. Using 
tcpcrypt with batch signing and CMAC mutual authenti- 
cation is strictly stronger than HTTP Digest authentica- 
tion over an SSL session, and more than 20 times faster. 
Using tcpcrypt and our unoptimized PAKE implementa- 
tion is almost twice as fast as SSL, and provides stronger 
security. Many other authentication mechanisms are pos- 
sible; we believe that tcpcrypt’s generality and simple 
application-level hooks are exactly what is required to 
get application writers to think about the form of authen- 
tication they really need, once they can address authen- 
tication separately from the question of how to encrypt 
session data. 

Finally, tcpcrypt interoperates seamlessly with legacy 
applications, TCP stacks, and middleboxes, making it 
easy to deploy incrementally. For all of the above 
reasons, we believe that it now makes sense to make 
transport-layer encryption the default. Make it happen 
by installing tcpcrypt from http://tcpcrypt.org. 


Acknowledgments 


We thank Alan Eustace, Daniel Giffin, Eric Grosse, Brad 
Karp, Adam Langley, the anonymous reviewers, and our 
shepherd, Nikita Borisov, for information, suggestions, 
feedback, and other assistance. This work was funded 
by gifts from Intel (to Brad Karp) and from Google, by 
NSF awards CNS-0716806 (A Clean-Slate Infrastructure 
for Information Flow Control) and CCR-033 1542 (POR- 
TIA), and by the EU FP7 Trilogy project. 


References 


[1] VIA Padlock Security Engine. 


[2] Ultra-FAST Cryptography on the Sun UltraSPARC T2. 
http://blogs.sun.com/bmseer/entry/ultra_fast_ 
cryptography_on_the. 


[3] BITTAU, A., HANDLEY, M., AND LACKEY, J. The final nail in 
WEP’s coffin. In SP ’06: Proceedings of the 2006 IEEE Sym- 
posium on Security and Privacy (Washington, DC, USA, 2006), 
IEEE Computer Society, pp. 386—400. 


[4] BONEH, D., AND SHOUP, V. A graduate course in applied cryp- 
tography. Version 0.1, from http: //cryptobook.net, 2008. 


19th USENIX Security Symposium 


[5 


|e! 


[6] 


[7] 


[8 


be 


[9] 


[10] 


[11] 


[12] 


[13] 


[14] 


[15] 


[16] 


[17] 


[18] 


[19] 


[20] 


[21] 


[22] 


[23] 


[24] 


[25] 


[26] 


[27] 


FEDERAL COMMUNICATIONS COMMISSION. Commission 
orders Comcast to end discriminatory network management 
practices. http://hraunfoss.fcc.gov/edocs_public/ 
attachmatch/DOC-284286A1 . pdf. 


FRANKS, J., HALLAM-BAKER, P., HOSTETLER,  J., 
LAWRENCE, S., LEACH, P., LUOTONEN, A., AND STEW- 
ARD, L. HTTP authentication: Basic and digest access 
authentication. RFC 2617, 1999. 


GRANBOULAN, L. How to repair ESIGN. In Security in Com- 
puter Networks (2003), vol. 2576 of LNCS, pp. 234-240. 


GUERON, S. Intel Advanced Encryption Standard (AES) Instruc- 
tions Set. Intel White Paper, Rev 03. 


HANDLEY, M., PAXSON, V., AND KREIBICH, C. Network in- 
trusion detection: evasion, traffic normalization, and end-to-end 
protocol semantics. In SSYM’01: Proceedings of the 10th con- 
ference on USENIX Security Symposium (Berkeley, CA, USA, 
2001), USENIX Association, pp. 9-9. 


HANSTEEN, P. N. M. The Book of PF - A No-Nonsense Guide 
to the OpenBSD Firewall. No Starch Press, 2007. 


HEFFERNAN, A. Protection of BGP Sessions via the TCP MD5 
Signature Option. RFC 2385, 1998. 


HURLEY, C. —Y,000,000,000utube. The Official YouTube 
Blog, http://youtube-global. blogspot .com/2009/10/ 
yOoooooooOutube. html. 


JACOBSON, V., BRADEN, R., AND BORMAN, D. TCP exten- 
sions for high performance. RFC 1323, 1992. 


JONCHERAY, L. A simple active attack against tcp. In SSYM’95: 
Proceedings of the 5th conference on USENIX UNIX Security 
Symposium (Berkeley, CA, USA, 1995), USENIX Association. 


KATZ, J., AND LINDELL, A. Y. Aggregate message authentica- 
tion codes. In Topics in Cryptology — CT-RSA (2008). 


KENT, S., AND ATKINSON, R. Security Architecture for the 
Internet Protocol. RFC 2401, 1998. 


LANGLEY, A. Obfuscated TCP. http://code. google.com/ 
p/obstcp/wiki/Transcript. 


LIN, C.-C. RSA Performance of Sun Fire T2000. 
http://blogs.sun.com/chichang1/entry/rsa_ 
performance_of_sun_fire. 


MATHIS, M., MAHDAVI, J., FLOYD, S., AND ROMANOW, A. 
TCP selective acknowledgement options. RFC 2018, 1996. 


MCCARTHY, C. Pingdom: Facebook is killing it on page 
views. CNET News, http://news.cnet.com/8301-13577_ 
3-10428394-36.html. 


PETER WALTENBERG. AES-GCM, AES-CCM, CMAC updated 
for OpenSSL 1.0 beta 2 — revised. 


RESCORLA, E. SSL and TLS: Designing and Building Secure 
Systems. Addison-Wesley Professional, 2000. 


RICHARDSON, M., AND REDELMEIER, D. Opportunistic En- 
cryption using the Internet Key Exchange (IKE). RFC 4322 (In- 
formational), December 2005. 


ROTHSCHILD, J. High performance at massive scale - lessons 
learned at Facebook. Seminar at UCSD. 


SCHILLACE,  S. Default HTTP access for Gmail. 
http://gmailblog. blogspot .com/2010/01/ 
default-https-access-for-gmail.html. 


TOUCH, J., BLACK, D., AND WANG, Y. Problem and Applica- 
bility Statement for Better-Than-Nothing Security (BTNS). RFC 
5387 (Informational), November 2008. 


TOUCH, J., MANKIN, A., AND BONICA, R. The TCP authenti- 
cation option. Internet draft (work in progress), July 2009. 


USENIX Association 


Automatic Generation of Remediation Procedures for Malware Infections 


Roberto Paleari', Lorenzo Martignoni*, Emanuele Passerini’, 


i 


Drew Davidson®, Matt Fredrikson®, Jon Giffin*, Somesh Jha? 


' Universita degli Studi di Milano 


{roberto, ema}@security.dico.unimi.it 


> University of Wisconsin 


{davidson, mfredrik, jha}@cs.wisc.edu 


Abstract 


Despite the widespread deployment of malware- 
detection software, in many situations it is difficult to 
preemptively block a malicious program from infecting 
a system. Rather, signatures for detection are usually 
available only after malware have started to infect a large 
group of systems. Ideally, infected systems should be 
reinstalled from scratch. However, due to the high cost 
of reinstallation, users may prefer to rely on the remedi- 
ation capabilities of malware detectors to revert the ef- 
fects of an infection. Unfortunately, current malware de- 
tectors perform this task poorly, leaving users’ systems 
in an unsafe or unstable state. This paper presents an 
architecture to automatically generate remediation pro- 
cedures from malicious programs— procedures that can 
be used to remediate all and only the effects of the mal- 
ware’s execution in any infected system. We have imple- 
mented a prototype of this architecture and used it to gen- 
erate remediation procedures for a corpus of more than 
200 malware binaries. Our evaluation demonstrates that 
the algorithm outperforms the remediation capabilities of 
top-rated commercial malware detectors. 


1 Introduction 


One of the most pressing problems faced by the In- 
ternet community today is the widespread diffusion of 
malware. ‘To defend against malware, users rely on 
signature- or behavior-based anti-malware software that 
attempts to detect and prevent malware from damaging 
an end-host. Unfortunately, in many cases detection and 
prevention are not possible. Malware authors have per- 
fected the practice of automatically creating a large num- 
ber of variants, or malware that appears new to detectors 
but exhibits the same behavior when executed. For new 
malware and variants, signatures for detection are rarely 
available by the time malware reaches a network, leaving 
a time window in which systems are susceptible to infec- 
tion. In these situations, the ability to detect and remove 
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the malware after infection is not enough—it is also im- 
perative that any harmful changes to the system made by 
the malware are remediated (or reverted). 


The safest way to remediate a system is to format 
the permanent storage and re-install the operating sys- 
tem from scratch. While effective, this approach is also 
costly and usually results in a loss of valuable personal 
data, particularly when data backups are incomplete or 
non-existent. Rather, end-users and administrators may 
prefer to remove only those resources left behind by the 
malware, leaving the rest of the system intact. Unfor- 
tunately, current anti-malware products perform poorly 
at this task. A recent study demonstrated that even top- 
rated commercial anti-malware software fails to revert 
the effects of all the actions performed by malware dur- 
ing infections [15]. Needless to say, partially-remediated 
systems are unstable and prone to error. 


In this paper, we present a system that automatically 
generates remediation procedures from malware bina- 
ries. These remediation procedures can be executed on 
infected systems to restore the state to a clean config- 
uration, and are capable of remediating the effects of a 
malware sample a posteriori, without observing the in- 
fection take place. The fact that our remediation proce- 
dures are generated to cover a particular malware binary, 
rather than a specific sequence of system events resulting 
in an infected state [7, 19], amounts to a substantial break 
from previous technologies. Using our system, one can 
generate a single general-purpose executable that is ca- 
pable of reversing the effects of a malware sample on an 
arbitrary number of hosts after the fact. In other words, 
one does not need to be aware of our system, or make 
use of it, until after the infection takes place. To achieve 
this goal, we rely on a combination of dynamic program 
analysis and semantic generalization to produce models 
of infection behavior that are resilient to common mal- 
ware anti-analysis techniques, such as the use of nonde- 
terminstic file names or the omission of malicious behav- 
ior on some runs of the program. Then, we translate these 


19th USENIX Security Symposium 419 


420 


behavior models directly into executable procedures that 
remediate the effects of a malware infection. 

We have implemented our ideas in a prototype tool. 
Using the prototype, we automatically generated reme- 
diation procedures on a corpus of more than 200 binary 
malware samples belonging to approximately 50 distinct 
families. We evaluated the practical effectiveness of each 
procedure by testing its ability to recognize all of the 
harmful effects of a malware execution (true positives) 
while leaving benign aspects of the system intact (true 
negatives). The results of our evaluation attest to the ef- 
fectiveness of our technique: in total, we reversed 98% 
of the harmful effects while generating only a single false 
positive, although we were not able to remediate user- 
specific resource changes such as deleted documents and 
personal file mutations. In contrast, the best commercial 
anti-malware product remediated only 82% of the effects 
of our corpus. 

In summary, we make the following contributions: 


e We present an architecture to automatically gener- 
ate remediation procedures given binary malware 
samples. To the best of our knowledge, our architec- 
ture is the first to work under the assumption that in- 
formation relating to a specific infection is not avail- 
able; rather, characteristic infection patterns are ob- 
served and generalized to produce effective proce- 
dures in this setting. 


e We evaluated an implementation of our framework 
on on more than 200 real malware samples and 
found that it was able to remediate the resulting in- 
fections more effectively than existing commercial 
antivirus products. We have made this implementa- 
tion available as an open-source package !. 


The rest of this paper is organized as follows: In Sec- 
tion 2 we discuss related work. In Section 3.1, we de- 
scribe the problem that our architecture solves by pre- 
senting a realistic example, and in Section 3.2 we outline 
our approach, relating it to the example. In Section 4, 
we formalize the problem of malware remediation and 
present the technical details of our approach. In Sec- 
tion 5, we evaluate the effectiveness of our approach by 
testing a prototype implementation against real malware. 
In Section 6 we discuss the limitations of our approach, 
the security implications, and potential avenues for fu- 
ture work. We present concluding remarks in Section 7. 


2 Related Work 


Our contributions relate to ongoing research on behavior- 
based malware analysis, on the execution of untrusted 


IThe URL for this tool is http://www.cs.wisc.edu/ 
~mfiredrik/remediate 
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applications in trusted systems, and on the automatic 
generation of signatures to detect malicious network traf- 
fic. 


Behavior-Based Malware Analysis: The prevalence 
of packed, polymorphic, and metamorphic malware 
highlights the deficiencies of traditional detection ap- 
proaches based on syntactic signatures. This has urged 
researchers and security practitioners to focus on solu- 
tions that base policy on the behavior exhibited by un- 
trusted software. Behavior-based techniques attempt to 
infer security-relevant information about an untrusted 
program either by analyzing it statically [16] or by ob- 
serving its operation dynamically [1, 11, 21]. The major 
drawback of current behavior-based techniques is their 
high computational overhead. Recently, Kolbitsch et al. 
developed an efficient analysis solution intended to re- 
place traditional anti-malware on the desktop [8]. Closer 
in spirit to the work presented here is that of Christodor- 
escu et al. [4]. They described an automatic approach 
that derives formal specifications of malicious behavior 
by comparing the observed dynamic behavior of mali- 
cious and benign applications. Their technique uses de- 
pendence graphs, which express the relationships among 
various low-level behavior events, and is similar in many 
ways to our high-level behavior abstraction component 
(see Section 4.2.1). Another area of much recent activity 
is that of automatic classification of malware into fami- 
lies [2, 18]. For that type of work, malware is grouped 
into clusters, which correspond to families, by some no- 
tion of behavioral similarity. Our technique uses a form 
of behavioral grouping as a means to remediate a system, 
but we go further than malware classification by attempt- 
ing to remove the harmful effects of the malware on the 
system. 


Execution of Untrusted Applications: In addition to 
work that attempts to detect or prevent the execution of 
malicious software, some work has been done to miti- 
gate the harmful effects of software a posteriori. Hsu 
et al. presented a framework for automatically repairing 
an infected system after monitoring the execution of the 
malware [7]. The actual work of remediating a system 
given a detailed description of the malicious execution 
is similar to the way that we construct remediation pro- 
cedures from generalized behavior models. Liang et al. 
described an alternative approach called Alcatraz [10]. 
In Alcatraz, an untrusted application is executed inside 
of a sandbox, and any change it makes is not commit- 
ted until the program is confirmed to be innocuous. The 
manner in which a program is deemed innocuous is con- 
sidered orthogonal to the main issue of sandboxing. The 
idea was later tweaked [19] so that all state changes made 
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by an application are cached, and upon program termina- 
tion the user decides whether or not to keep any changes. 
The primary differences between these techniques and 
the one presented in this paper is that they rely on in- 
formation regarding specific execution traces, whereas 
our remediation procedures use generalized notions of 
the behavior of a malware instance. As such, our sys- 
tem can remediate harmful effects of malware, including 
some effects that were not observed in a trace. 


Automatic Signature Generation: The generalized 
behavior models that we use to construct executable re- 
mediation procedures can be viewed as generic signa- 
tures relating the effects of a malicious program on sys- 
tem resources. Different approaches have been proposed 
for automatically generating attack signatures. Poly- 
graph [14] is one of the first systems proposed by re- 
searchers to address the problem of generating network 
signatures to detect polymorphic worms. Polygraph 
identifies invariant fragments of packets that are found in 
all the network flows generated by the same worm, since 
they are necessary for the worm to successfully exploit a 
given vulnerability. These fragments are then combined 
into signatures using different techniques. Hamsa [9] 
addresses the same problem using a different algorithm 
that identifies and combines invariants. Hamsa’s signa- 
tures have better accuracy and are more resilient against 
attacks than Polygraph. Finally, Nemean [20] gener- 
ates semantics-aware signatures to detect network intru- 
sions. Nemean’s methodology, consisting of high-level 
network traffic abstraction, clustering, and generalization 
using automata learning, is similar to ours. However, 
we operate on a fundamentally different domain than 
Nemean, which generates signatures of network packet 
traces. 


3 Overview 


In this section, we motivate our work using a realistic 
example of a malware infection and present our architec- 
ture by walking through the steps that it takes to remedi- 
ate the example. 


3.1 Motivation 


Consider the malware whose pseudo-code is shown 
in Figure |. This program generates a random file- 
name located in the system directory, drops a mali- 
cious payload into the file, creates a new registry value 
that causes the payload to be executed at system boot 
time, tampers with the system’s network name resolver 
(c:\...\etc\hosts), and infects a benign system li- 
brary (c:\windows\user32.d11). Our goal is to gen- 
erate a procedure that remediates infections caused by 


USENIX Association 


any possible execution of this code. In this case, re- 
covery includes: (1) deleting the file containing a copy 
of the malicious payload, (2) deleting the registry key 
created to start the malware at boot, (3) disinfecting 
c:\windows\user32.d11, and (4) restoring the original 
configuration of the name resolver c:\...\etc\hosts. It 
is important that the effects of a// malicious actions taken 
by the malware are removed. For example, consider what 
happens when (1), (2), and (3) are remediated, but not 
(4). In this case, all internet traffic on the host remains 
subject to hijacking by the malware, so the system is still 
in a dangerous configuration. Many commercial prod- 
ucts would leave the system in this configuration [15]. 


Completely remediating the effects of the malware in 
Figure | is not as straightforward as the example might 
suggest. First, high-level source code is usually not avail- 
able when dealing with real malware. Given the well- 
known difficulty of statically analysing adversarial bi- 
nary code [13], this means we must partially rely on 
dynamic information. Although this example does not 
illustrate it, there is a possibility that the malware con- 
tains paths that are rarely executed under normal circum- 
stances. Any harmful effects produced on such a path 
would be difficult to account for in a remediation pro- 
cedure, because the problem of discovering such an ef- 
fect dynamically is extremely difficult. Secondly, mal- 
ware can appear to be nondetermistic by relying on sub- 
tle details in its environment, such as the system clock 
or pseudorandom number generator. This behavior is of- 
ten present even on common paths, and is apparent in 
our example, despite its simplicity: Both the filename of 
the malicious payload and the name of the registry value 
used to activate the payload depend on randomness. 


Given the limited nature of dynamic program informa- 
tion, it may be hard to generate a remediation procedure 
that precisely accounts for all of the nondeterminism in 
a program. Procedures that do so may mistakenly iden- 
tify benign system resources as malicious and attempt to 
remediate them. Consider a remediation procedure that 
attempts to account for the nondeterminism in our exam- 
ple by looking for all files in the system directory with 
the suffix .exe. While this policy would effectively cap- 
ture the nondeterminism in the payload filename, any at- 
tempt to remediate resources based on it would result in 
the unacceptable removal of benign executables. Con- 
versely, procedures that do not attempt to generalize ex- 
ecution behavior are likely to miss some malicious ef- 
fects that must be remediated. For example, after run- 
ning the sample malware once, we might find that the 
payload is delivered in c: \windows\poqwz.exe. If a re- 
mediation procedure does not generalize this information 
and only ever looks for this file when remediating infec- 
tions caused by other executions of this malware, then it 
will miss the payload file most of the time, as it is not 
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1 // generate random file and value names 

2 filename = "po" + random_alpha() + random_alpha() + random_alpha() + ".exe"; 
3 valuename = (random_int() % 2) ? "qv" : "vq"; 

4... 

5 // drop malicious code 

6 £ = CreateFile(”c:\windows\” + filename, GENERIC WRITE, ...); 

7 WriterPile(f, malicious but, ..+)4 

& WriteFile(f; other malicious but, iss)} 

Ot ds 

10 // start the newly created executable at boot 

ll RegOpenKey(HKEY LOCAL MACHINE, "...\Windows\CurrentVersion\Run", &r); 

12 if (RegQueryValueEx(r, valuename, NULL, REG SZ, ...) == ERROR_FILE NOT FOUND) 
13 RegSetKeyValue(r, valuename, REG SZ, filename, ...); 


15 // infect user32.dll 


16 g = CreateFile("c:\windows\user32.d1l1", FILE APPEND DATA, 


17 WriteFile(g, malicious but, s...); 


., OPEN EXISTING, ...); 


19 // hijack HTTP connections to www. google.com and www. citibank .com 


20 h = CreateFile("c:\windows\system32\drivers\etc\hosts", ..., OPEN EXISTING, ...); 
21. ‘ReadFile(h, but. «s.); 
22 WriteFile(f, "67.42.10.3 www.google.com\n67.42.10.3 www.citibank.com", ...); 


24 // delete main executable 
25 DeleteFile("c:\malware.exe" ); 


Figure 1: Pseudo-code of a sample malicious program. 


possible to observe the malware long enough to see all 
possible variants of the payload file name. 


3.2 Architecture Overview 


The architecture we have developed for generating re- 
mediation procedures from malware binaries is shown in 
Figure 2. It has three primary components: (1) an execu- 
tion monitor that infers the malware’s high-level behav- 
iors from a low-level trace, (2) a component that gener- 
alizes the high-level behaviors from multiple executions 
of the malware, and (3) a component that produces exe- 
cutable remediation procedures from generalized behav- 
iors. The entire system works sequentially, with each 
component using the information produced by the one 
preceding it. 


High-Level Behavior Extraction: The high-level be- 
havior extraction component (numbered 1 in Figure 2) 
analyzes the semantics of a program to produce a se- 
quence of meaningful behaviors relevant to remediation. 
Because malware authors usually obfuscate their bina- 
ries, we rely on dynamic information to infer these be- 
haviors; we execute binaries in a special environment (an 
emulator) to extract a low-level execution trace, perform 
analysis using manually constructed rules, and arrive at a 
high-level trace [11]. Table 1 lists the high-level behav- 
iors we consider. Each behavior modifies the state of the 
system in some way and is parameterized by a set of ar- 
guments that determine which aspects of the system state 
are affected. The behaviors currently listed correspond to 
those that commonly occur in malware, that are manda- 
tory to infect a system, and were constructed manually to 
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reflect the salient behavioral features of most malware. 
However, our technique can be extended to operate over 
a wider set of high-level behaviors. 


The environment in which a program runs typically 
affects its behavior, and malware often exhibits a certain 
degree of nondeterminism. To account for these factors, 
we collect several high-level behavior traces for each 
sample. To do so, we vary the environment by chang- 
ing factors that malware typically rely on, such as lo- 
cale, service pack level, and so forth. Although not sup- 
ported by our current implementation, path exploration 
techniques [12] can be applied in this component to ac- 
count for a more complete subset of the malware’s be- 
havior, as in Bouncer [5]. The lack of path exploration 
techniques is not a fundamental limitation of our system, 
and can be easily plugged into our system. 


Our high-level behavior extractor would infer that 
the sample malware from Figure | demonstrates the 
FileCreation, RegistryCreation, DropAndAutostart, 
and FileInfection behaviors, with different argu- 
ments for FileCreation, RegistryCreation, and 
DropAndAutostart on each execution. 


Behavior Generalization: After producing a set of 
high-level behavior traces for a malware sample, we at- 
tempt to account for nondeterminism by creating a gen- 
eral, abstract model of behavior that accounts for all of 
the concrete traces we observed (numbered 2 in Fig- 
ure 2). Note that generalization attempts to overapprox- 
imate existing paths, thus encompassing future paths, 
rather than explore as many new paths as possible. In ef- 
fect, this patches some of the incompleteness of dynamic 
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Figure 2: Architecture of the system for generating remediation procedures. In this figure, S denotes a system call 
trace, 6 denotes a high-level behavior trace inferred from a system call trace, C denotes a cluster, and C denotes a 


generalized cluster. 


analysis by extrapolating observed information to future, 
unseen executions of the malware. This is accomplished 
by recognizing when distinct behaviors from multiple 
high-level traces, with possibly different arguments, are 
actually instances of the same malicious activity. We re- 
fer to this matching of behaviors as clustering. When a 
cluster is identified, the arguments of its constituent be- 
haviors are generalized to tolerate any differences that 
may be present in the actual values. Thus, nondetermin- 
ism is accounted for via overapproximation by ensuring 
that this generalization extends to future, unseen execu- 
tions. 

In the malware from Figure 1, our technique would 
cluster all instances of the same high-level behavior to- 
gether. For example, all instances of DropAndAutostart 
would be clustered together and all instances of 
FileInfection would be clustered together.  Be- 
cause there is likely variation among the arguments of 
DropAndAutostart, we construct a regular expression 
to tolerate minor differences while ensuring that be- 
nign files are not mistakenly identified. The final re- 
sult of the computation for this behavior would be a 
DropAndAutoStart behavior with generic file argument 
c:\windows\po[[:alpha:]]{3}.exe to generalize the 
random filename at line 2, generic registry key/value pair 
...\CurrentVersion\Run for the registry touched at line 
11, and (qv|vq) for the registry value randomly created 
at line 3. 


Remediation Procedure Generation: The third com- 
ponent of our architecture (numbered 3 in Figure 2) gen- 
erates executable remediation procedures from the gen- 
eralized behaviors produced in the previous step. The 
resulting procedure examines the state of the system on 
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which it runs in search of symptoms of an infection, and 
removes the symptoms whenever possible. It attempts 
to match each resource (file, process, or registry key) on 
the system against the constraints associated with each 
generalized high-level behavior. For our running exam- 
ple, each file is matched against the regular expression 
c:\windows\po[[:alpha:]]{3}.exe associated with the 
first argument of the DropAndAutoStart behavior, an- 
other regular expression associated with the second ar- 
gument, and a final one describing the content of the file. 
If such a file is found, then the registry values under the 
key ...\CurrentVersion\Run are matched against the 
regular expression (qv|vq). If such a value is found and 
its data matches the current filename being considered, 
then all of the resources (the file and registry key pair) 
are removed. Currently, we only produce remediation 
procedures that operate on system files. For technical 
reasons explained in Section 4, we do not handle user- 
specific files and resources. While this is a limitation 
of our current approach, we hope to remove it in future 
work. 


4 Generating Remediation Procedures 


In this section, we present the details of our sys- 
tem for generating remediation procedures. We begin 
by formalizing the problem solved by our system and 
continue component-by-component describing the algo- 
rithms used to solve the problem. 


4.1 Problem Description 


When malware runs on a system, it may infect the system 
by changing its persistent state in an undesirable way. 
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Behavior Arguments Description 

FileCreation File name and content Creation of a new file 

RegistryCreation Key name and content Creation of a new registry value 
DropAndAutostart File name and content. Key name Creation of a new file and of a registry value con- 


and content 


Resource 
creation 


DropAndExecute File and process name 


FileInfection 
served regions 


Resource 
infection 


RegistryInfection Key name and content 


FileDeletion File name 
RegistryDeletion Key name 


Resource 
deletion 


taining its name (to execute the file automatically at 
every boot) 
Creation and execution of a new executable 


File name and content. List of pre- Infection of an existing file 


Replacement of an existing registry value 


Deletion of an existing file 
Deletion of an existing registry value 


Table 1: High-level behaviors considered for remediation. 


For our purposes, the state S of a system is modeled 
as an association from resource names NV to data from 
a domain D. Individual elements of S' are referred to 
as resources. To simplify notation, we let S stand for 
the set of possible system states. Because most malware 
is written for Windows platforms, our targeted resource 
namespace consists of Windows filenames, registry key 
and value names, and process names. The data domain 
is the set of all finite-length bit strings. 


The infection behavior of a malware can be under- 
stood as a transition relation between system states. 
There are three ways in which the malware can mod- 
ify the state of a system: (/) resources may be com- 
pletely removed from the system, (2) new resources may 
be added to the system, and (3) the data corresponding 
to existing resources may be mutated. Because the infec- 
tion behavior of a malware can be succinctly described 
in terms of these three operations and the resources over 
which they operate, we represent it using an infection re- 
lation R C Sx N x S x S that encodes this informa- 
tion. Intuitively, the infection relation describes the way 
in which a particular malware changes the state of a sys- 
tem. Given an element (9, Neem, Sadd; Smut) € R, the 
malware transforms state S' into a new state by removing 
the resources labeled by N,-;,, adding the resources in 
Sada, and modifying the resources in S7.,,. Note that the 
infection behavior is described as a relation rather than a 
function mapping. This is because of the fact that mal- 
ware may behave nondeterministically when it infects a 
system—it may infect the same system state in different 
ways on two distinct executions. 


After a given piece of malware has infected a system, 
the goal of remediation is to undo the effects of the in- 
fection, returning the system to a clean state. More pre- 
cisely, given a malware binary, we seek to construct an 
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infection relation for that malware that describes its be- 
havior. We can then use the information in the infection 
relation to enact changes on the system that remediate 
the effects of the malware: restoring any files that were 
removed (Nem) or mutated ($4), and removing files 
that were added (Sgqq). We package this functionality 
as an executable remediation procedure, as described in 
Section 3.2. In general, there are a number of approaches 
that may realize the goal of constructing the infection re- 
lation corresponding to a given malware. In this paper, 
we focus on applying dynamic analysis to the malware 
sample to extract the information necessary to construct 
the infection relation. 


In practice, it is not usually possible to reconstruct the 
true infection relation from a malware binary. Rather, we 
compute a relation that overapproximates the actual be- 
havior for a finite set of execution paths exhibited by the 
malware. For example, we overapproximate the resource 
names involved in the DropAndAutoStart behavior 
of Figure | by creating a regular expression that matches 
all of the resource names on the set of execution traces 
we observed. Furthermore, our approximate infection re- 
lations do not contain information regarding the removal 
or mutation of non-system files, as it is generally not pos- 
sible to restore this state without additional information 
not encoded in the malware. Of course, using an ap- 
proximate infection relation for remediation introduces 
the possibility of false negatives and false positives. A 
false negative occurs when the remediation fails to prop- 
erly reverse the changes left by the malware. Similarly, a 
false positive occurs when remediation affects resources 
that were not touched by the malware. Both types of er- 
ror are possible given the way we construct approximate 
infection relations. For example, false positives may re- 
sult from the overapproximation of resource names with 
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regular expressions, whereas false negatives may result 
from the fact that we do not account for all possible exe- 
cution paths in the malware. Thus, it is our goal to con- 
struct an approximate infection relation that minimizes 
false positives and false negatives 


4.2 System Details 


This section details the specific algorithms and subsys- 
tems used in the three main components of our system 
(depicted in Figure 2). 


4.2.1 High-Level Behavior Extraction 


Intuitively, the problem of high-level behavior extraction 
is to derive a concise description of the behavior seman- 
tics demonstrated by a malware sample. Given a mal- 
ware sample m and a set D of high-level behavior tem- 
plates that describe events related to system state modifi- 
cation, the goal of this task is to produce a sequence of 1n- 
stances of the members of D, along with a corresponding 
low-level description of system events that match each 
template instance. 

The set of behavior templates used in our prototype 
is given in Table 1. To infer high-level behaviors from 
a stream of system calls, we use multilayer behavior 
specifications, aS proposed in previous work [11]. AI- 
though the details of the inference algorithm are beyond 
the scope of this paper, we give a brief account of the 
main points here. Each high-level behavior is described 
in terms of a hierarchical model. Each level of the hierar- 
chy is composed of a set of behavior summaries and their 
accompanying behavior graphs. The graph for a given 
behavior summary encodes the behavior operationally, in 
terms of events and the dependencies among them. The 
events in a graph at a particular level are defined in terms 
of the summaries of levels lower in the hierarchy. The 
top level of the hierarchy corresponds to the final output 
of the inference, and the layers beneath it provide de- 
tails of incremental specificity, until the lowest level is 
reached. In our prototype, the lowest level corresponds 
to a system call trace collected in a virtual environment. 
We use a modified version of QEMU [3] to monitor an 
application for its system call trace. 

The nodes in the behavior graphs at each layer cor- 
respond to events that are observed by the monitor, and 
the edges correspond to data dependencies between the 
events. For example, in the graphs at the lowest level, 
system calls that operate on the same resource handle 
have edges between their representative nodes that re- 
flect this dependency. At the highest level, this relation- 
ship is preserved by edges that denote the fact that the 
corresponding set of high-level behaviors operate on the 
same file. Representing high-level behavior graphs hier- 
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archically has one crucial advantage: the same high-level 
behavior can be described in terms of multiple alterna- 
tive intermediate behaviors. For example, our high-level 
behavior DropAndAutostart can be represented in terms 
of all possible low-level system call sequences that cre- 
ate a new file, write executable content into it, and then 
change the system configuration to activate the dropped 
file at boot time. Because there are numerous distinct 
ways to accomplish this high-level task in terms of sys- 
tem calls, it is important to account for all of them in a 
clean and straightforward way. Our heirarchical behavior 
model formalism allows this, and thus makes our system 
more resilient to this type of evasion. 

Figure 3 shows a sample system call trace and two 
of the high-level behaviors extracted from it. The fig- 
ure shows both the concrete graphs and the template 
instances that were matched. The first four system 
calls in the trace (members s,; through s4) are exe- 
cuted by the malware sample to replicate its payload 
into a new file. These calls are associated with the 
layer-1 behavior FileCreation. Similarly, the system 
calls s11,5 3, and s,4 are associated with the layer-1 
behavior RegistryCreation, and the last system call 
(S41) with the behavior FileDeletion. Since the be- 
haviors FileCreation and RegistryCreation are re- 
lated, the algorithm infers the high-level layer-2 behav- 
ior DropAndAutostart, which represents the fact that 
the malware replicates and configures the system to exe- 
cute the malicious payload at boot. Note that this high- 
level behavior was inferred hierarchically; the fact that 
DropAndAutoStart is present in the trace was inferred 
only from layer-1 behaviors, which were in turn inferred 
from system calls originally found in the trace. By 
modularizing the template definitions in this way, our 
high-level behavior inference technique gains a certain 
amount of resilience to obfuscations and differences in 
malware implementation [11]. 


4.2.2 Behavior Clustering 


Given a set of high-level behavior traces {B',..., B”} 
corresponding to multiple executions of the same mal- 
ware sample, behavior clustering identifies elements of 
distinct traces that correspond to the same malicious ac- 
tivity. An admissible clustering for a given set of traces is 
a set of behavior sets {C,,C2,...,C,} that satisfies two 
conditions: 


1. All behaviors in a given cluster C; have the same 
type. For example, all behaviors are of type 
DropAndAutostart. 


2. The clustering partitions the set of all events in ev- 
ery execution trace: no behavior is in more than one 
cluster, and each behavior is in some cluster. 
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S1 NtCreateFile("poqwz.exe") — f£ 

so NtWriteFile(f, "...malicious code...") 

S3 NtWriteFile(f, "...other malicious code...") 
Ss4  NtClose(£) 


Si, NtOpenkey("Run") —- fr 

Si2 NtQueryValueKey(r, "vq") — FAILURE 
Si3 NtSetValueKey(r, "vg", "poqwz.exe") 
S14 NtClose(r) 


So, NtOpenFile("...\system32\user32.dll") —- g 
Soo NtWriteFile(g, "...malicious data...") 
so3 NtClose(g) 


S31 NtOpenFile("c:\windows\hosts") — h 

s32 NtReadFile(h, 1024) — "# Copyright (c)..." 

S33 NtWriteFile(h, "67.42.10.3 www.google.com...") 
S34 NtWriteFile(h, "67.42.10.3 www.citibank.com...") 
S35 NtClose(h) 


S41 NtDeleteFile("c:\malware.exe") 


(a) 





FileDeletion 


(Dropandautostart | 














FileCreation na RegistryCreation 





High-Level Behavior Summaries 


DropAndAutostart ("pogwz.exe", data, "Run", "vq", 
"Doqwz.exe") 
FileCreation("pogwz.exe", data) 
FileDeletion("c:\malware.exe") 
RegistryCreation("Run", "vg", "pogwz.exe") 
(b) 


Figure 3: The system call trace for our sample malware. exe (a) and high-level behaviors generated from the trace 


(b). 


In later stages of the system, it generalizes behaviors 
in the same cluster by overapproximating their argument 
values. Thus, desirable clusterings are those that lead to 
tighter overapproximations, while still grouping related 
behaviors together in order to allow generalization. As 
an example, Figure 4 shows two high-level traces of our 
sample malicious program. We denote the j“” behavior 
observed in the i*” execution trace as bi. For these traces, 
we want to group behaviors b} and b? because they cor- 
respond to the same activity, and generalizing their ar- 
guments leads to a tight overapproximation: we can use 
regular expressions that match a fairly small set of strings 
(namely, po||: alpha :]]{3}.exe). Similarly, we want to 
group b5 with b3 and bs with b3. However, had the sec- 
ond trace contained another DropAndAutostart behav- 
ior for an executable named avkiller.exe, then cluster- 
ing by} with this behavior would have resulted in a poor 
generalization. An optimal clustering is one that includes 
all related high-level behaviors so that generalization will 
create a powerful regular expression that finds all traces 
of a malicious behavior. On the other hand, an optimal 
clustering must not include unrelated high-level behav- 
iors, as a generalization of such a cluster is likely to 
match benign system resources. 


Cluster Formation: Exhaustively searching for the 
optimal clustering of {B',...,B’™} is infeasible, as 
there are an exponential number of possibilities. Thus, 
we do not attempt to find an optimal clustering and in- 
stead rely on the heuristic method shown in Algorithm 1. 
The algorithm begins by finding the execution trace with 
the greatest number of high-level behaviors 6”, and 
creating an initial clustering by placing each 6;”"°" in 
its own cluster C;. Then, for each remaining behavior 
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trace B’, the events are enumerated in execution order 
and added to the first cluster that satisfies the admissibil- 
ity criterion discussed above. We discuss the details of 
matching event types below. If an event cannot be added 
to any existing cluster, then a new cluster is initialized 
with the current event. This process is repeated until no 
traces remain, at which point the current set of clusters is 
returned as the final result. 


Intuitively, the heuristics in this algorithm rely on two 
asssumptions: (/) distinct executions of the malware ex- 
hibit similar malicious behaviors, and (2) the ordering of 
malicious behaviors between executions is similar. By 
selecting the trace with the greatest number of events to 
seed the clustering process and assuming that different 
executions contain a similar set of behaviors, we seek 
clusterings that group as many behaviors together as pos- 
sible. By adding events to existing clusters in execution 
order and assuming that the order does not vary substan- 
tially between executions, we seek clusterings that match 
similar argument values, thus resulting in tighter over- 
approximations in the behavior generalization phase of 
the system. Furthermore, these heuristics allow our al- 
gorithm to operate efficiently: Algorithm | runs in time 
linear in the number of execution traces and the length of 
the traces. 


For an example of how Algorithm 1 works, consider 
the two high-level execution traces depicted in Figure 4. 
As both traces are of equal length, the first is chosen, 
in this case B'. Clusters C,, Co, and C3 are initial- 
ized with behaviors bj, b5, and b3, respectively. bt and 
b? can then be matched, as they are both instances of 
the DropAndAutostart high-level behavior. Simi- 
larly, b5 is matched to b3, and b3 is matched to b. Fi- 
nally, the algorithm returns clusters {C,,C2,C3} where 
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FileCreation RegistryCreation 


b}: DropAndAutostart ("c:\...\poqwz.exe", data, "...\Run", 
"vq" ’ "pogqwz .exe! ) 
bd : FileDeletion("c:\malware.exe") 


bg : FileInfection("...\etc\hosts", "67.42...", data) 





be: DropAndAutostart ("c:\...\pobxz.exe", data, "...\Run", 
"vq , "pobxz ~exe! ) 

b2 : FileDeletion("c:\malware.exe") 

b3 : FileInfection("...\etc\hosts", "67.42...", data) 


Figure 4: High-level behavior clustering. 


C, = {bt, b?} represents DropAndAutostart behav- 
iors, Co = {b5, 03} represents FileDeletion behav- 
iors, and C3 = {b3,b%} represents FileInfection 
behaviors. 


Behavior Comparison: Our clustering algorithm re- 
quires a sub-algorithm, isomorphic, to compare two be- 
haviors. Intuitively, we perform this comparison by nor- 
malizing the graphs corresponding to each behavior and 
then checking whether the resulting normalized graphs 
are isomorphic. There is an important advantage in com- 
paring the behavior graphs rather than their high-level 
summaries: nondeterminism in a malicious program typ- 
ically affects the summary of the behavior, but not the 
low-level operations used to achieve the behavior. There- 
fore, this approach is more resilient to nondeterminism 
and performs a more thorough comparison, eventually 
yielding more precise results. 

The normalization we perform on each graph mainly 
consists of abstracting away details of the behavior that 
are likely affected by nondeterminism. System call ar- 
guments that represent resource names are replaced by 
constants that denote their type. For example, we use 
a different constant for each file and registry type. Se- 
quences of system calls that operate sequentially on the 
same resource are replaced with a single, batch call that 
is semantically identical. Finally, we ignore system calls 
whose effects are later killed, 1.e. overwritten or other- 
wise reversed. In this way, our normalization step pro- 
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duces more succinct graph representations of the mal- 
ware’s behavior that are largely independent of common 
forms of nondeterminism. 

After normalizing two graphs for comparison, we use 
the VFlib2 graph isomorphism algorithm [6]. Although 
isomorphism is a difficult problem and may be inefficient 
to compute on large graphs, we point out that the normal- 
ized behavior graphs resulting from real-world programs 
are typically quite small, comprising no more than a few 
dozen nodes. 


4.2.3. Behavior Generalization 


After clustering, we have several sets of behaviors 
grouped by semantic similarity but still differing in cer- 
tain details. For example, when we build clusters we 
group together behaviors that differ in the specific re- 
sources they identify. The goal of behavior generaliza- 
tion is to produce a single canonical behavior that rep- 
resents all of the members of a given cluster, as well as 
variations of the members that are likely to result from 
other executions of the malware. In terms of the def- 
initions presented in Section 4.1, behavior generaliza- 
tion produces high-level behaviors with arguments con- 
structed to accurately represent the resources modified 
by observed executions, while generalizing to potential 
future executions. 

Algorithm 2 presents Generalize, our procedure for 
generalizing a behavior cluster. Intuitively, generaliza- 
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Algorithm 1 Cluster(B, B™**) 


Require: 6 is a_ set of high-level behavior traces 
CBT IB vaausdae 
B™°” is the high-level behavior trace containing the maxi- 
mum number of high-level behaviors 
Result: A set of clusters of high-level behaviors of 
CBS henley 
C+ @ 
for 67°" € BY do 
add new cluster {b;} to C 
end for 
for all B’ € B/B™* do 
{Traces are enumerated in the order of collection. } 
for all b € B’ do 
{Behaviors are enumerated in execution order. } 
for all C;, € C do 
if isomorphic(b\, b;) where 6; is a behavior in Cy 
then 
Cy — Cr U {bi} 
end if 
end for 
if bi is not in any cluster then 
add new cluster {bj} to C 
end if 
end for 
end for 
return (C) 


tion is performed on each high-level behavior argument 
individually, and the individual results are eventually 
combined to produce the generalized behavior. Because 
each cluster member represents the same high-level be- 
havior, and therefore has the same number of arguments 
as the others, we are assured that all of the relevant infor- 
mation is included in the generalization. Furthermore, 
because all arguments for the behaviors that we are inter- 
ested in have straightforward canonical representations 
as strings, the problem of generalizing each argument 
can be reduced to the problem of generalizing sets of 
strings. Generalize proceeds in this vein, iterating over 
each argument for the behaviors in a given cluster C’. Af- 
ter collecting each string for a given argument in a set A;, 
a probabilistic finite-state automaton (PSFA) that accepts 
all of the strings in A; is constructed using the simulated 
beam annealing algorithm [17]. By merging states that 
are probabilistically very similar, the resulting automaton 
accepts a superset of A;, thus resulting an initial gener- 
alization. 

After building the PFSA, certain regions of the state 
transition diagram are examined for reduction using a set 
G of generalization rules, which are templates for gen- 
erating regular expressions that overapproximate high- 
level behavior arguments. We refer to a single-entry 
single-exit region as one whose entry is composed of a 
node 7, that is the immediate dominator of the exit node 
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Algorithm 2 Generalize(C, G, 6) 


Require: C’ is a cluster of behaviors that differ only in argu- 
ment values, G is a set of generalization rules, 6 is the density 
threshold. 

Result: A generalized high-level behavior. 

{Loop through all arguments for behaviors in cluster C’} 
for 1 = 0 to |args(Co)| do 

A; — O 

{Gather all values for current argument} 

forc in Cdo 

A; — A; U args, (c) 

end for 

{Generate PFSA that captures argument values} 

(V, E) — Prsa(A;) 

{Find dense regions in the PFSA} 

for (ni,n2) in Vx V—{(n,n)|n eV} do 


if —idom(e1, €2) or —ipdom(ne,ni) or 
numpaths(ni,n2) < 6 then 

continue 
end if 


forr in Gdo 
E’ — r(paths(n1,n2)) 
E — (E — paths(ni,n2)) U EB’ 
end for 
end for 
{Build regular expression for the current arguments} 
Gi <— regexp(E) 
end for 
{Return new behavior with type matching C’, and gen- 
eralized reg. exp. arguments} 
return name(Co)(Go,...,G), 0 <n < |args(Co)| 


n2, which is the immediate postdominator of n,. Fur- 
thermore, we require that the number of paths between 
m1 and no be at least 6. The actual value of 0 is es- 
timated empirically. This information is represented in 
Algorithm 2 with the relations 7domg and ipdom p, as 
well as the function numpaths ;. When a suitable single- 
entry single-exit region is found, each rule in G is applied 
in an attempt to generalize it. The generalization rules 
that we use have been chosen on the basis of experience 
and consider information such as the number of paths in 
the region, the probabilities associated with the paths, the 
lengths of the paths, and the characters composing the 
strings associated with each path. If a rule is able to gen- 
eralize the region, then it returns a smaller set of edges 
that are used to replace the original region. Otherwise, 
the rule returns the original region, and the next rule is 
applied. After all rules in G have been applied, a reg- 
ular expression is built from the resulting PFSA, which 
is eventually used as an argument in the final general- 
ized behavior. The final behavior is represented in Algo- 
rithm 2 by name(Co)(Go,...,Gn). Here, name(Co) 
returns the behavior name of the high-level behavior Co, 
which is used to build the final generalized behavior from 
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1 DropAndAutostart ("c:\windows)\...poagp.exe", data, 
2 DropAndAutostart ("c:\windows\...pobxz.exe", data, 
3 DropAndAutostart ("c:\windows\...pocra.exe", data, 
4 DropAndAutostart ("c:\windows\...pomfg.exe", data, 
5 DropAndAutostart ("c:\windows)\...pommp.exe", data, 
6 DropAndAutostart ("c:\windows)\...popwz.exe", data, 
7 DropAndAutostart ("c:\windows)\...pouwk.exe", data, 


. .Windows\CurrentVersion\Run", "vg", "poagp.exe") 
. .Windows\CurrentVersion\Run", "vg", "pobxz.exe") 
. .Windows\CurrentVersion\Run", "qv", "pocra.exe") 
. .Windows\CurrentVersion\Run", "vq", "pomfg.exe") 
. .Windows\CurrentVersion\Run", "qv", "pommp.exe") 
. .Windows\CurrentVersion\Run", "qv", "popwz.exe") 
. .Windows\CurrentVersion\Run", "vg", "pouwk. exe") 


Figure 5: Sample cluster grouping seven different occurrences of the DropAndAutostart behavior manifested by 
our sample malware (the corresponding graphs are omitted for conciseness). 


the individual argument generalizations. 

As an illustration of this algorithm, consider the clus- 
ter presented in Figure 5. We apply the PFSA algorithm 
to the first argument to arrive at the minimal automaton 
shown in Figure 6. The automaton contains a single- 
entry single-exit region with several paths, as highlighted 
in the figure, that encodes the variable substring of the 
filename. One of the generalization rules that we use 
is triggered by the fact that this region is dense, 1.e. it 
contains many paths from entry to exit, as well as the 
fact that it contains only alphabetic characters. Thus, it 
returns a single edge labeled ||: alpha :|]{3}, which is a 
wildcard sequence that denotes all alphabetic strings of 
length three. The generalized PFSA results in the reg- 
ular expression c: \windows\po||: alpha :||{3}.exe, 
which is capable of identifying all the names of 
the files that our sample malicious program could 
touch on the system. After applying Generalize to 
all arguments of DropAndAutostart, we obtain a 
generic model of the cluster behavior represented by 
DropAndAutostart(“c : \windows\po||: alpha :]]{3} 
.exe”, data, “...Windows\CurrentVersion\Run” , 


“(vq|qv)”). 


4.2.4 Generating Concrete Remediation Procedures 


Each generalized high-level behavior must be remediated 
differently. Our approach to generating executable re- 
mediation procedures may be understood conceptually 
in two parts. First, the generalized high-level behaviors 
for each cluster are used to construct an approximate in- 
fection relation Ff as discussed in Section 4.1. Then, we 
use a generic procedure that scans the infection relation, 
and changes the state of the system based on the contents 
of each entry. When constructing the infection relation, 
our procedure uses a model of a clean, bare installation 
of the operating system installed on the machine for the 
first system state component of each tuple. The use of 
a bare installation enables us to remediate infected sys- 
tem resources up to the correct service pack installed on 
the system, but not personal or application-specific re- 
sources. 

The remainder of this section details the way that spe- 
cific high-level behaviors are translated into entries in the 
abstract infection relation, as well as the way that the 
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Algorithm 3 Remediate(S, R) 


(abs Nvenis Saad: Sie) — (Siies IN penis adic Sait) Se 
R such that S' has the same operating system version as S'qo5 
for s in Sqaq do 
cases s: 
(name, data) : if file name exists, with contents 
matching data then remove name. 
((key, value), data) : if (key, value) exists with 
contents matching data 
then remove (key, value). 
((file, key, value), (data, regdata)) : 
if file exists and is a suffix of some element of 
Dyregdata that also exists in a key matching 
(key, value) then remove file and 
(key, value). 
((file, procname), data) : 
if procname and (file, data) exist matching 
file, data, procname and procname is a 
suffix of file then remove file and 
kill procname. 
end cases 
end for 
for 2 in J,,.; do 
cases 2: 
(file, data) : Remove (file, data) and replace it 
with (file, data’) € B(S) 
((key, value), data) : Remove ((key, value), data) 
and replace it with 
((key, value), data) € B(S). 
end cases 
end for 


abstract infection relation is used to generate a concrete 
(executable) remediation procedure. 


Newly-Created Resources: Remediating resources 
that are created by malware is straightforward, because 
the remediation procedure only needs information re- 
garding the names and data of newly-created resources 
to completely remove the corresponding resources from 
the system. Our remediation procedures are capable of 
removing files and registry keys. To account for the pos- 
sibility that the infection could create resources that were 
not observed in a high-level behavior trace during anal- 
ysis, we instead use generalized high-level behaviors in 
the infection relation R. 
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Single-entry-single-exit region 





Figure 6: A fragment of the minimized automaton constructed to generalize the first argument of the 
DropAndAutostart behavior, starting from the occurrences of the argument reported in Figure 5. 


For the high-level file creation behavior 
FileCreation(name, data), we find the resource 
for name and data and append this pair to Sada. 
Similarly, for the high-level registry creation behavior 
RegistryCreation (key, value, data), we associate 
the key/value pair to the corresponding data and add 
them as a pair to S,gq. As shown in Algorithm 3, the 
remediation procedure processes these entries in the 
infection relation R by checking for the existence of the 
resource names on the system and removing them if they 
exist with the contents specified by Ry. 

Remediating the DropAndAutostart and 
DropAndExecute behaviors is more complicated, 
as doing so involves multiple resources that are related 
in a constrained manner. To handle a high-level behavior 
of the form: 


DropAndAutostart(file, data, key, value, regdata) 


we group the resource names: file, key, value together as 
a compound resource name for a new element in Sqqq, 
and group data and regdata together for the correspond- 
ing data component. The remediation procedure acts on 
such an entry by scanning system resources for names 
that match the file name and registry key/value pairs. If a 
match is found, the corresponding resources are removed 
only if the concrete filename is a suffix of the concrete 
registry data and the concrete data matches the abstract 
data. 

For example, when the procedure encounters the gen- 
eralized DropAndAutostart from Figure 5, it will aug- 
ment S,gq with the following resource: 


(c : \windows\po|{: alpha :]]{3}\.exe, 
(...\CurrentVersion, Run), 


(data, po||: alpha :]]{3}\.exe)) 


The remediation procedure will then 
search the system for a _ file that matches 
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c: \windows\po||: alpha :]|{3}.exe, as well as 
the registry key (...\CurrentVersion, Run), and will 
remove the resources only if the value of the registry key 
matches the name of any file that matches the regular 
expression. 


Infected Resources: Remediating infected resources 
is more challenging than newly-created resources. In 
general, it is not possible to know the contents of a file 
before infection takes place, so it is not possible to re- 
store their contents to a clean state. The exception to this 
fact is with operating system files, which are common 
to all systems and can thus be known to the remediation 
procedure a priori. 

A naive approach to remediating high-level 
FileInfection(name, region, data) behaviors 
would be to replace the entire file with the corresponding 
file in the bare operating system. However, uninfected 
regions of data may be removed by this technique, which 
could result in the loss of important system data, or 
leave the system in an inconsistent state. To avoid this 
circumstance, high-level behavior traces keep track of 
uninfected regions regions in addition to file name file 
and infected data data. We update the S,,.,4 component 
of R to account for a FileInfection behavior only 
if there is an actual file in the clean operating system 
state whose name matches the file. In this case, Siu 18 
updated with the contents of file in the bare operating 
system state, modified by preserving the portions listed 
in regions and overwriting the rest with data. As indi- 
cated in Algorithm 3, when the remediation procedure 
finds file, it replaces the infected regions with a pristine 
copy from the bare operating system. 

Similarly, when a high-level 
RegistryInfection((key, value), data) behavior 
is encountered, and it is determined that a counterpart of 
(key, value) exists in the bare operating system, Synuz is 
modified by adding the key/value pair together with the 


USENIX Association 


modified data to the list of infected resources. As with 
infected files, Algorithm 3 remediates these resources 
by locating a pristine copy of (key, value) in the bare 
Operating system and replacing the infected resource 
with it. 


Deleted Resources: Currently, most malware is writ- 
ten with the intent of leveraging infected systems to per- 
petrate profitable, albeit illicit, activities. Therefore, it 
is very rare to see malware removing system resources, 
as doing so would render the system useless for money- 
making activities. For this reason, our remediation pro- 
cedures do not handle deleted resources. 


5 Evaluation 


We applied our remediation procedure generation algo- 
rithm to over two hundred malware samples collected in 
the wild. We evaluated the quality of the generated pro- 
cedures with respect to two metrics: false positives and 
false negatives. A false positive occurs when a resource 
is mistakenly identified as being part of a malware in- 
fection and subsequently remediated. A false negative 
occurs when a resource that was actually involved in an 
infection is not identified and left untouched by the re- 
mediation procedure. The results of our evaluation tes- 
tify to the effectiveness of our technique: we observed a 
low false negative rate, with more than 98% of the ma- 
licious resources successfully remediated, and only one 
false positive was encountered. Finally, we compare our 
results to the remediation capabilities of the three com- 
mercial products that performed best in previous experi- 
ments [15]. 


5.1 Experimental Setup 


Our experiments were performed over a corpus of 200 
malicious programs, obtained through our own honey- 
pot, and a web crawler that crawls known malicious do- 
mains for executable files. Several traces for each sam- 
ple were collected by executing it in multiple distinct en- 
vironments. To extract a wide range of behaviors from 
each sample, we modified the environments along a va- 
riety of dimensions, including locale, timezone, and the 
set of installed applications. Specifically, for each sam- 
ple we performed the following steps: 


1. Execute the sample three times in five different en- 
vironments, collecting a system call trace for each 
execution. Apply the algorithm described in Sec- 
tion 4.2 to generate a remediation procedure from 
the collected data. 
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2. Infect twenty-five test environments, all of them dis- 
tinct from those used to collect traces, with the sam- 
ple. 


3. Execute the generated remediation procedure in 
each test environment. 


4. Compare the remediated state to the original (clean) 
state. Tally the false positives and false negatives. 


Although we do not attempt to extract all possible exe- 
cution paths from the malware, this strategy allows us to 
observe a reasonable range of malware behavior in vari- 
ous settings. 


5.2 False Negatives 


Figure 7 compares the false negative rate of our 
automatically-generated remediation procedures with 
the three top-rated commercial malware detectors eval- 
uated in [15]: Nod32 Anti-Virus 3.0, Panda Anti-Virus 
9.0.5, and Kaspersky Anti-Virus 2009. The graph depicts 
the average number of malicious resources that were re- 
mediated over the entire malware corpus. Resources are 
divided into three categories: files, registry keys, and 
processes. Each of these classes is further divided into 
two subcategories: primary and ancillary. Primary re- 
sources are composed of executable files, registry keys 
that activate process creation, and processes that arise 
from files dropped or infected by the malware sample. 
Roughly, we argue that all other resources are not as crit- 
ical to the security of the system, and are thus considered 
ancillary. 

For the majority of these categories and subcategories, 
our remediation procedures are more complete than com- 
mercial anti-malware products. For example, our proce- 
dures were able to remediate more than 99% of the pri- 
mary file resources, whereas the best commercial prod- 
uct we tested reached only 82% in this subcategory. Sim- 
ilarly, our procedures remediated 99% of primary reg- 
istry activities, while commercial products did not ex- 
ceed 86%. Furthermore, while ancillary objects are of- 
ten ignored by commercial remediation procedures, our 
procedures remediated 95% of ancilliary files and 98% 
of ancilliary registry activities. The portion of file and 
registry resources that were not remediated by our proce- 
dures correspond to behaviors that were never observed 
while collecting traces. This illustrates the primary limi- 
tation of our dynamic analysis-based approach and high- 
lights a clear avenue for improvement in future work. Fi- 
nally, our procedures remediated 100% of primary pro- 
cess resources. However, the performance on ancillary 
processes is significantly lower. This is a result of the fact 
that our processes do not have access to enough informa- 
tion to discern a benign process from a process spawned 
by the malware using a pre-existing benign file. 
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Figure 7: Comparison of the completeness of our automatically generated remediation procedures with the complete- 
ness of the procedures employed in three top-rated commercial malware detectors. 


5.3. False Positives 


To quantify false positives, we compared the set of re- 
sources affected by each malware sample in each test en- 
vironment with the set of resources our procedures re- 
mediated in each test environment. Any remediated re- 
source not affected by the corresponding malware sam- 
ple in at least one trace is considered a false positive. 
We found that only one of our procedures produced any 
false positives. The cause of this false positive, not sur- 
prisingly, was a high-level behavior argument specified 
by a very general regular expression. This implies that 
the nondeterminism demonstrated by the corresponding 
malware sample was too complex to be easily described 
by a regular language. Thus, one area for future work 
is utilizing more expressive language classes, such as 
context-free grammars, for generalizing argument val- 
ues. 


6 Discussion 


We are aware of some limitations of our system. Some of 
these limitations could be exploited by attackers to cause 
the system to produce remediation procedures that are of 
limited value. In this section, we discuss these limita- 
tions and present some solutions that we will investigate 
in the future to address the limitations. 

We constructed the models that we use to detect high- 
level behaviors by leveraging years of experience in mal- 
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ware analysis, and we carefully tested all models to en- 
sure that they cannot be evaded. However, since we can- 
not prove that these models are perfect, we must take into 
account the possibility that attackers could find new ways 
to perform some high-level malicious activities without 
being detected. Moreover, in our proof-of-concept 1m- 
plementation, multiple execution traces are obtained by 
executing the same malware in several different operat- 
ing system configurations. If attackers introduced dan- 
gerous behaviors to their malicious programs that are not 
triggered in our monitoring environment, then the result- 
ing procedure would not be able to remediate such be- 
haviors. Clearly, one area for future work is in expanding 
the coverage of the dynamic behavioral analysis. While 
our approach covers some of the potential behavior of the 
sample, more sophisticated techniques [12, 21] can be 
applied to increase the likelihood that all relevant paths 
through the malware are explored. 


The high-level behaviors observed in multiple execu- 
tion traces are clustered to identify the instances of the 
same behavior. If the clusters we generate did not in- 
clude all the instances of the same behavior, or if they 
included instances of different behaviors, then the reme- 
diation procedures constructed by generalizing the be- 
haviors associated to each cluster would be too specific 
or too generic. An attacker could write malicious pro- 
grams that manifest certain behaviors to break the clus- 
tering. Similarly, the regular expressions used by our re- 
mediation procedures to identify affected resources are 
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generalized heuristically. Attackers could develop mali- 
cious programs that affect resources in a way that induces 
us to perform very aggressive generalization (e.g. create 
files with random names anywhere in the file system) and 
thus to generate remediation procedures that remove be- 
nign files. We plan to address these problems in the fu- 
ture. One approach is to introduce a feedback loop while 
clustering behaviors and generating regular expressions 
to validate the quality of the results. This feedback loop 
would repeat the process until no further progress can be 
made. Finally, we assert that it is not possible to cause 
our algorithm to generate a procedure that modifies ex- 
isting files in a harmful way. This follows from the fact 
that system files are only ever restored to their original 
state by the procedure, not modified. 


We currently generate a remediation procedure for 
each malware sample we analyze. We plan to extend 
our system to generate remediation procedures that cover 
more than one malware sample. For example, it would be 
useful to generate remediation procedures that are capa- 
ble of operating on all samples for a given malware fam- 
ily. Because the generated procedures will likely have 
to account for a much higher degree of nondeterminism 
than those that target only a single sample, additional 
care must be taken to ensure that the high-level behav- 
iors models are not too general, thus resulting in false 
positives. 


7 Conclusion 


In this paper, we have presented a technique for auto- 
matically generating malware remediation procedures. 
Given a malware binary, our system produces executable 
code that removes the harmful effects of executing that 
malware on a system. We use dynamic analysis and be- 
havior generalization to account for the difficulties posed 
by real malware, thus allowing our procedures to effec- 
tively remediate many possible executions of the mal- 
ware without witnessing the actual infection take place. 
This contribution represents a major break with previ- 
ous automatic remediation techniques, which required 
detailed information about the particular infection being 
targeted. We implemented our technique and evaluated 
its effectiveness on more than 200 malware binaries. The 
performance of our prototype is quite good: on average, 
98% of the harmful effects are remediated, and we en- 
countered only a single false positive. In the future, we 
plan to build on this work by extending it to work on 
entire families, as well as exploring more precise tech- 
niques for generalizing observed malware behaviors. 
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Abstract 


Reverse Turing tests, or CAPTCHAs, have become an 
ubiquitous defense used to protect open Web resources 
from being exploited at scale. An effective CAPTCHA 
resists existing mechanistic software solving, yet can 
be solved with high probability by a human being. In 
response, a robust solving ecosystem has emerged, re- 
selling both automated solving technology and real- 
time human labor to bypass these protections. Thus, 
CAPTCHAS can increasingly be understood and evaluated 
in purely economic terms; the market price of a solution 
vs the monetizable value of the asset being protected. We 
examine the market-side of this question in depth, ana- 
lyzing the behavior and dynamics of CAPTCHA-solving 
service providers, their price performance, and the un- 
derlying labor markets driving this economy. 


1 Introduction 


Questions of Internet security frequently reflect under- 
lying economic forces that create both opportunities and 
incentives for exploitation. For example, much of today’s 
Internet economy revolves around advertising revenue, 
and consequently, a vast array of services—including e- 
mail, social networking, blogging—are now available to 
new users on a basis that is both free and largely anony- 
mous. The implicit compact underlying this model is that 
the users of these services are individuals and thus are 
effectively “paying” for services indirectly through their 
unique exposure to ad content. Unsurprisingly, attack- 
ers have sought to exploit this same freedom and acquire 
large numbers of resources under singular control, which 
can in turn be monetized (e.g., via thousands of free Web 
mail accounts for sourcing spam e-mail messages). 
CAPTCHAS were developed as a means to limit the 
ability of attackers to scale their activities using auto- 
mated means. In its most common implementation, a 
CAPTCHA consists of a visual challenge in the form of 
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alphanumeric characters that are distorted in such a way 
that available computer vision algorithms have difficulty 
segmenting and recognizing the text. At the same time, 
humans, with some effort, have the ability to decipher 
the text and thus respond to the challenge correctly. To- 
day, CAPTCHAs of various kinds are ubiquitously de- 
ployed for guarding account registration, comment post- 
ing, and so on. 

This innovation has, in turn, attached value to the 
problem of solving CAPTCHAs and created an indus- 
trial market. Such commercial CAPTCHA solving comes 
in two varieties: automated solving and human labor. 
The first approach defines a technical arms race between 
those developing solving algorithms and those who de- 
velop ever more obfuscated CAPTCHA challenges in re- 
sponse. However, unlike similar arms races that revolve 
around spam or malware, we will argue that the underly- 
ing cost structure favors the defender, and consequently, 
the conscientious defender has largely won the war. 

The second approach has been transformative, since 
the use of human labor to solve CAPTCHAs effectively 
side-steps their design point. Moreover, the combination 
of cheap Internet access and the commodity nature of 
today’s CAPTCHAs has globalized the solving market; 
in fact, wholesale cost has dropped rapidly as providers 
have recruited workers from the lowest cost labor mar- 
kets. Today, there are many service providers that can 
solve large numbers of CAPTCHAS via on-demand ser- 
vices with retail prices as low as $1 per thousand. 

In either case, we argue that the security of CAPTCHAS 
can now be considered in an economic light. This prop- 
erty pits the underlying cost of CAPTCHA solving, ei- 
ther in amortized development time for software solvers 
or piece-meal in the global labor market, against the 
value of the asset it protects. While the very existence of 
CAPTCHA-Ssolving services tells us that the value of the 
associated assets (e.g., an e-mail account) is worth more 
to some attackers than the cost of solving the CAPTCHA, 
the overall shape of the market is poorly understood. Ab- 
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Figure 1: Examples of CAPTCHAs from various Internet properties. 


sent this understanding, it is difficult to reason about the 
security value that CAPTCHAs offer us. 

This paper investigates this issue in depth and, where 
possible, on a empirical basis. We document the commer- 
cial evolution of automated solving tools (particularly via 
the successful Xrumer forum spamming package) and 
how they have been largely eclipsed by the emergence 
of the human-based CAPTCHA-solving market. To char- 
acterize this latter development, our approach is to en- 
gage the retail CAPTCHA-solving market on both the sup- 
ply side and the demand side, as both a client and as 
“workers for hire.” In addition to these empirical mea- 
surements, we also interviewed the owner and operator 
of a successful CAPTCHA-solving service (MR. E), who 
has provided us both validation and insight into the less 
visible aspects of the underlying business processes.! In 
the course of our analysis, we attempt to address key 
questions such as which CAPTCHAsS are most heavily tar- 
geted, the rough solving capacity of the market leaders, 
the relationship of service quality to price, the impact 
of market transparency and arbitrage, the demographics 
of the underlying workforce and the adaptability of ser- 
vice offerings to changes in CAPTCHA content. We be- 
lieve our findings, or at least our methodology, provide 
a context for reasoning about the net value provided by 
CAPTCHAS under existing threats and offer some direc- 
tions for future development. 

The remainder of this paper is organized as fol- 
lows: Section 2 reviews CAPTCHA design and provides 
a qualitative history and overview of the CAPTCHA- 
solving ecosystem. Next, in Section 3 we empirically 
characterize two automated solver systems, the popular 
Xrumer package and a specialized reCaptcha solver. In 
Sections 4 and 5 we then characterize today’s human- 
powered CAPTCHA-solving services, first describing our 


'By agreement, we do not identify MR. E or the particular service 
he runs. While we cannot validate all of his statements, when we tested 
his service empirically our results for measures such as response time, 
accuracy, capacity and labor makeup were consistent with his reports, 
supporting his veracity. 
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data collection approach and then presenting our experi- 
ments to measure key qualities such as response time, ac- 
curacy, and capacity. Section 6 describes the demograph- 
ics of the CAPTCHA-solving labor pool. Finally, we dis- 
cuss the implications of our results in Section 7 along 
with potential directions for future research. 


2 Background 


The term “CAPTCHA” was first introduced in 2000 by 
von Ahn et al. [21], describing a test that can differentiate 
humans from computers. Under common definitions [4], 
the test must be: 


e Easily solved by humans, 
e Easily generated and evaluated, but 
e Not easily solved by computer. 


Over the past decade, a number of different techniques 
for generating CAPTCHAs have been developed, each 
satisfying the properties described above to varying de- 
grees. The most commonly found CAPTCHAs are visual 
challenges that require the user to identify alphanumeric 
characters present in an image obfuscated by some com- 
bination of noise and distortion.2 Figure 1 shows ex- 
amples of such visual CAPTCHAs. The basic challenge 
in designing these obfuscations is to make them easy 
enough that users are not dissuaded from attempting a so- 
lution, yet still too difficult to solve using available com- 
puter vision algorithms. 

The issue of usability has been studied on a functional 
level—focusing on differences in expected accuracy and 
response time [3, 19, 22, 26]—but the ultimate effect of 
CAPTCHA difficulty on legitimate goal-oriented users is 
not well documented in the literature. That said, Elson et 
al. provide anecdotal evidence that “even relatively sim- 
ple challenges can drive away a substantial number of po- 


There exists a range of non-textual and even non-visual CAPTCHAS 
that have been created but, excepting Microsoft’s Asirra [9], we do not 
consider them here as they play a small role in the current CAPTCHA- 
solving ecosystem. 
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tential customers” [9], suggesting CAPTCHA design re- 
flects a real trade-off between protection and usability. 

The second challenge, defeating automation, has re- 
ceived far more attention and has kicked off a competi- 
tion of sorts between those building ever more sophisti- 
cated algorithms for breaking CAPTCHAs and those cre- 
ating new, more obfuscated CAPTCHAS in response [7, 
11, 16, 17, 18, 25]. In the next section we examine this 
issue 1n more depth and explain why, for economic rea- 
sons, automated solving has been relegated to a niche 
status in the open market. 

Finally, an alternative regime for solving CAPTCHAS 
is to outsource the problem to human workers. Indeed, 
this labor-based approach has been commoditized and 
today a broad range of providers operate to buy and sell 
CAPTCHA-Solving service in bulk. We are by no means 
the first to identify the growth of this activity. In particu- 
lar, Danchev provides an excellent overview of several 
CAPTCHA-Solving services in his 2008 blog post “In- 
side India’s CAPTCHA solving economy” [5]. We are, 
however, unaware of significant quantitative analysis of 
the solving ecosystem and its underlying economics. The 
closest work to our own is the complementary study of 
Bursztein et al. [3] which also uses active CAPTCHA- 
solving experiments, but is focused primarily on the issue 
of CAPTCHA difficulty rather than the underlying busi- 
ness models. 


3 Automated Software Solvers 


From the standpoint of an adversary, automated solv- 
ing offers a number of clear advantages, including both 
near-zero marginal cost and near-infinite capacity. At 
a high level, automated CAPTCHA solving combines 
segmentation algorithms, designed to extract individ- 
ual symbols from a distorted image, with basic op- 
tical character recognition (OCR) to identify the text 
present in CAPTCHAs. However, building such algo- 
rithms is complex (by definition, since CAPTCHAS are 
designed to evade existing vision techniques), and auto- 
mated CAPTCHA solving often fails to replicate human 
accuracy. These constraints have in turn influenced the 
evolution of automated CAPTCHA solving as it transi- 
tioned from a mere academic contest to an issue of com- 
mercial viability. 


3.1 Empirical Case Studies 


We explore these issues empirically through two rep- 
resentative examples: Xrumer, a mature forum spam- 
ming tool with integrated support for solving a range 
of CAPTCHAs and reCaptchaOCR, a modern specialized 
solver that targets the popular reCaptcha service. 
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Xrumer 


Xrumer [24] is a well-known forum spamming tool, 
widely described on “blackhat’ SEO forums as being one 
of the most advanced tools for bypassing many differ- 
ent anti-spam mechanisms, including CAPTCHAs. It has 
been commercially available since 2006 and currently re- 
tails for $540, and we purchased a copy from the au- 
thor at this price for experimentation. While we would 
have liked to include several other well known spamming 
tools (SEnuke, AutoPligg, ScrapeBox, etc), the cost of 
these packages range from $97 to $297, which would 
render this study prohibitively expensive. 

Xrumer’s market success in turn led to a surge of 
spam postings causing most service providers targeted 
by Xrumer to update their CAPTCHAs. This development 
kicked off an “arms race” period in Xrumer’s evolution 
as the author updated solvers to overcome these obsta- 
cles. Version 5.0 of Xrumer was released in October of 
2008 with significantly improved support for CAPTCHA 
solving. We empirically verified that 5.0 was capable 
of solving the default CAPTCHAs for then current ver- 
sions of a number of major message boards, including: 
Invision Power Board (IPB) version 2.3.0, phpBB ver- 
sion 3.0.2, Simple Machine Forums (SMF) version 1.1.6, 
and vBulletin version 3.6. These systems responded in 
kind, and when we installed versions of these packages 
released shortly after Xrumer 5.0 (in particular, phpBB 
and vBulletin) we verified that their CAPTCHAs had been 
modified to defeat Xrumer’s contemporaneous solver. 
Today, we have found that the only major message fo- 
rum software whose default CAPTCHA Xrumer can solve 
is Simple Machines Forum (SMF). 

With version 5.0.9 (released August 2009), Xrumer 
added integration for human-based CAPTCHA-solving 
services: Anti-Captcha (an alias for Antigate) and 
CaptchaBot. We take this as an indication that the author 
of Xrumer found the ongoing investment in CAPTCHA- 
solving software to be insufficient to support customer 
requirements.” That said, Xrumer can be configured 
to use a hybrid software/human based approach where 
Xrumer detects instances of CAPTCHAs vulnerable to its 
automated solvers and uses human-based solvers oth- 
erwise. In the current version of Xrumer (5.0.12), the 
CAPTCHA-related development seems to focus on sup- 
porting automatic navigation and CAPTCHA “extraction” 
(detecting the CAPTCHA and identifying the image file 
to send to the human-based CAPTCHA-solving service) 
of more Web sites, as well as evading other anti-spam 
techniques. 


>The developers of Xrumer have recently been advertising en- 
hanced CAPTCHA-solving functionality in their forthcoming “7.0 Elite” 
version (including support for reCaptcha), but the release date has been 
steadily postponed and, as of this writing (June 2010), version 5.0.12 is 
the latest. 
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When compared with developers targeting “high- 
value” CAPTCHAS (e.g., reCaptcha, Microsoft, Yahoo, 
Google, etc.), Xrumer has mostly targeted “weaker” 
CAPTCHAS and seems to have a policy of only includ- 
ing highly efficient and accurate software-based solvers. 
In our tests, all but one included solver required a second 
or less per CAPTCHA (on a netbook class computer with 
only a 1.6-GHz Intel Atom CPU) and had an accuracy of 
100%. The one more difficult case was the solver for the 
phpBB version 3 forum software with the GD CAPTCHA 
generator and foreground noise. In this case, Xrumer had 
an accuracy of only 35% and required 6—7 seconds per 
CAPTCHA to execute. 


reCaptchaOCR 


At the other end of the spectrum, we obtained a spe- 
cialized solver focused singularly on the popular re- 
Captcha service. Wilkins developed the solver as a proof 
of concept [23]. The existence of this OCR-based re- 
Captcha solver was reported in a blog posting on De- 
cember 15, 2009 [6]. Although developed to defeat an 
earlier version of reCaptcha CAPTCHAs (Figure 2a), re- 
CaptchaOCR was also able to defeat the CAPTCHA vari- 
ant in use at the time of release (Figure 2b). Subse- 
quently, reCaptcha changed their CAPTCHA-generation 
code again to the version as of this writing (Figure 2c). 
The tool has not been updated to solve this new variant. 

We tested reCaptchaOCR on 100 randomly selected 
CAPTCHAs of the early 2008 variant and 100 randomly 
selected CAPTCHAs of the late 2009 variant. We scored 
the answers returned using the same algorithm that re- 
Captcha uses by default. reCaptcha images consist of 
two words, a control word for which the correct solu- 
tion is known, and the other a word for which the solu- 
tion is unknown (the service is used to opportunistically 
implement human-based OCR functionality for difficult 
words). By default reCaptcha will mark a solution as cor- 
rect if it is within an edit distance of one of the control 
word. However, while we know the ground truth for both 
words in our tests, we do not know which was the control 
word. Thus, we credited the solver with half a correct so- 
lution for each word it solved correctly in the CAPTCHA, 
reasoning that there was a 50% chance of each word be- 
ing the control word. 

We observed an accuracy of 30% for the 2008-era test 
set and 18% for the 2009-era test set using the default 
setting of 613 iterations,* far lower than the average hu- 
man accuracy for the same challenges (75-90% in our 
experiments). 

Finally, we measured the overhead of reCaptchaOCR. 
On a laptop using a 2.13-GHz Intel Core 2 Duo each so- 


+The solver performs multiple iterations and uses the majority so- 
lution to improve its accuracy. 
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lution required an average of 105 seconds. By reducing 
the number of iterations to 75 we could reduce the solv- 
ing time to 12 seconds per CAPTCHA, which is in line 
with the response time for a human solver. At this num- 
ber of iterations, reCaptchaOCR still achieved similar ac- 
curacies: 29% for the 2008-era CAPTCHAs and 17% for 
the 2009-era CAPTCHAS. 


3.2 Economics 


Both of these examples illustrate the inherent challenges 
in fielding commercial CAPTCHA-solving software. 

While the CAPTCHA problem is often portrayed in 
academia as a technical competition between CAPTCHA 
designers and computer vision experts, this perspective 
does not capture the business realities of the CAPTCHA- 
solving ecosystem. Arms races in computer security 
(e.g., anti-virus, anti-spam, etc.) traditionally favor the 
adversary, largely because the attacker’s role is to gen- 
erate new instances while the defender must recognize 
them—and the recognition problem is almost always 
much harder. However, CAPTCHAS reverse these roles 
since Web sites can be agile in their use of new CAPTCHA 
types, while attackers own the more challenging recog- 
nition problem. Thus, the economics of automated solv- 
ing are driven by several factors: the cost to develop new 
solvers, the accuracy of these solvers and the responsive- 
ness of the sites whose CAPTCHAS are attacked. 

While it is difficult to precisely quantify the develop- 
ment cost for new solvers, it is clear that highly skilled 
labor is required and such developers must charge com- 
mensurate fees to recoup their time investment. Anecdo- 
tally, we contacted one such developer who was offering 
an automated solving library for the current reCaptcha 
CAPTCHA. He was charging $6,500 on a non-exclusive 
basis, and we did not pay to test this solver. 

At the same time, as we saw with reCaptchaOCR, it 
can be particularly difficult to produce automated solvers 
that can deliver human-comparable accuracy (especially 
for “high-value” CAPTCHAs). While it seems that accu- 
racy should be a minor factor since the cost of attempt- 
ing a CAPTCHA is all but “free’’, in reality low success 
rates limit both the utility of a solver and its useful life- 
time. In particular, over short time scales, many forums 
will blacklist an IP address after 5—7 failed attempts. 
More importantly, should a solver be put into wide use, 
changes in the gross CAPTCHA success rate over longer 
periods (e.g., days) is a strong indicator that a software 
solver is in use—a signature savvy sites use to revise 
their CAPTCHAs in turn.” 

Thus, for a software solver to be profitable, its price 
must be less than the total value that can be extracted 


>We are aware that some well-managed sites already have alterna- 
tive CAPTCHAS ready for swift deployment in just such a situation. 
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(a) Early 2008 


(b) December 16th 2009 


(c) January 24th 2010 


Figure 2: Examples of CAPTCHAs downloaded directly from reCaptcha at different time periods. 


in the useful lifetime before the solver is detected and 
the CAPTCHA changed. Moreover, for this approach to 
be attractive, it must also cost less than the alterna- 
tive: using a human CAPTCHA-solving service. To make 
this tradeoff concrete, consider the scenario in which a 
CAPTCHA-solving service provider must choose between 
commissioning a new software solver (e.g., for a variant 
of a popular CAPTCHA) or simply outsourcing recogni- 
tion piecemeal to human laborers. If we suppose that it 
costs $10,000 to implement a solver for a new CAPTCHA 
type with a 30% accuracy (like reCaptchaOCR), then it 
would need to be used over 65 million times (20 mil- 
lion successful) before it was a better strategy than sim- 
ply hiring labor at $0.5/1,000.° However, the evidence 
from reCaptcha’s response to reCaptchaOCR suggests 
that CAPTCHA providers are well able to respond before 
such amortization is successful. Indeed, in our interview, 
MR. E said that he had dabbled with automated solving 
but that new solvers stopped working too quickly. In his 
own words, “It is a big waste of time.” 

For these reasons, software solvers appear to have 
been relegated to a niche status in the solving 
ecosystem—focusing on those CAPTCHAs that are static 
or change slowly in response to pressure. While a tech- 
nological breakthrough could reverse this state of affairs, 
for now it appears that human-based solving has come to 
dominate the commercial market for service. 


4 Human Solver Services 


Since CAPTCHAS are only intended to obstruct au- 
tomated solvers, their design point can be entirely 
sidestepped by outsourcing the task to human labor 
pools, either opportunistically or on a “for hire” basis. In 
this section, we review the evolution of this labor market, 
its basic economics and some of the underlying ethical 
issues that informed our subsequent measurement study. 


4.1 Opportunistic Solving 


Opportunistic human solving relies on convincing an in- 
dividual to solve a CAPTCHA as part of some other un- 
related task. For example, an adversary controlling ac- 
cess to a popular Web site might use its visitors to op- 


©Moreover, human labor is highly flexible and can be used for the 
wide variety of CAPTCHAS demanded by customers, while a software 
solver inevitably is specialized to one particular CAPTCHA type. 
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portunistically solving third-party CAPTCHAs by offer- 
ing these challenges as its own [1, 8]. A modern vari- 
ant of this approach has recently been employed by the 
Koobface botnet, which asks infected users to solve a 
CAPTCHA (under the guise of a Microsoft system man- 
agement task) [13]. However, we believe that retention 
of these unwitting solvers will be difficult due to the high 
profile nature and annoyance of such a strategy, and we 
do not believe that opportunistic solving plays a major 
role in the market today. 


4.2 Paid Solving 


Our focus is instead on paid labor, which we believe now 
represents the core of the CAPTCHA-solving ecosystem, 
and the business model that has emerged around it. Fig- 
ure 3 illustrates a typical workflow and the business rela- 
tionships involved. 

The premise underlying this approach is that there ex- 
ists a pool of workers who are willing to interactively 
solve CAPTCHAS in exchange for less money than the 
solutions are worth to the client paying for their services. 

The earliest description we have found for such a re- 
lationship is in a Symantec Blog post from September 
2006 that documents an advertisement for a full-time 
CAPTCHA Solver [20]. The author estimates that the re- 
sulting bids were equivalent to roughly one cent per 
CAPTCHA solved, or $10/1,000 (solving prices are com- 
monly expressed in units of 1,000 CAPTCHAs solved). 
Starting from this date, one can find increasing num- 
bers of such advertisements on “work-for-hire” sites such 
as getafreelancer.com, freelancejobsearch.com, and mis- 
tersoft.com. Shortly thereafter, retail CAPTCHA-solving 
services began to surface to resell such capabilities to a 
broad range of customers. 

Moreover, a fairly standard business model has 
emerged in which such retailers aggregate the demand 
for CAPTCHA-solving services via a public Web site 
and open API. The example in Figure 3 shows the 
DeCaptcher service performing this role in steps @ 
and ©. In addition, these retailers aggregate the sup- 
ply of CAPTCHA-solving labor by actively recruiting 
individuals to participate in both public and private 
Web-based “job sites” that provide online payments for 
CAPTCHAs Solved. PixProfit, a worker aggregator for the 
DeCaptcher service, performs this role in steps @—© in 
the example. 
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Figure 3: CAPTCHA-solving market workflow: © GYC Automator attempts to register a Gmail account and is challenged with a 
Google CAPTCHA. @ GYC uses the DeCaptcher plug-in to solve the CAPTCHA at $2/1,000. @ DeCaptcher queues the CAPTCHA 
for a worker on the affiliated PixProfit back end. © PixProfit selects a worker and pays at $1/1,000. © Worker enters a solution to 
PixProfit, which © returns it to the plug-in. © GYC then enters the solution for the CAPTCHA to Gmail to register the account. 


4.3 Economics 


While the market for CAPTCHA-solving services has 
expanded, the wages of workers solving CAPTCHAS 
have been declining. A cursory examination of histori- 
cal advertisements on getafreelancer.com shows that, in 
2007, CAPTCHA solving routinely commanded wages as 
high as $10/1,000, but by mid-2008 a typical offer had 
sunk to $1.5/1,000, $1/1,000 by mid-2009, and today 
$0.75/1,000 is common, with some workers earning as 
little as $0.5/1,000. 

This downward price pressure reflects the commodity 
nature of CAPTCHA solving. Since solving is an unskilled 
activity, it can easily be sourced, via the Internet, from 
the most advantageous labor market—namely the one 
with the lowest labor cost. We see anecdotal evidence of 
precisely this pattern as advertisers switched from pur- 
suing laborers in Eastern Europe to those in Bangladesh, 
China, India and Vietnam (observations further corrobo- 
rated by our own experimental results later). 

Moreover, competition on the retail side exerts 
pressure for all such employers to reduce their wages 
in turn. For example, here is an excerpt from a recent 
announcement at typethat.biz, the “worker side” of one 
such CAPTCHA-Solving service: 


009-12-14 13:54 Admin post 

Hello, as you could see, server was unstable 
last days. We can’t get more captchas 
because of too high prices in comparison 
with other services. To solve this problem, 
unfortunately we have to change the rate, 

on Tuesday it will be reduced. 
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Shortly thereafter, typethat.biz reduced their offered 
rate from $1/1,000 to $0.75/1,000 to stay competitive. 

These changes reflect similar decreases on the re- 
tail side: the customer cost to have 1,000 CAPTCHAs 
solved is now commonly $2/1,000 and can be as low as 
$1/1,000. To protect prices, a number of retailers have 
tried to tie their services to third-party products with 
varying degrees of success. For example, GYC Automa- 
tor is a popular “black hat” bulk account creator for 
Gmail, Yahoo and Craigslist; Figure 3 shows GYC’s 
role in the CAPTCHA ecosystem, with the tool scrap- 
ing a CAPTCHA in step © and supplying a CAPTCHA 
solution in step ©. GYC has a relationship with the 
CAPTCHA-Solving service Image2Type (not to be con- 
fused with ImageToType). Similarly, SENuke is a blog 
and forum spamming product that has integral sup- 
port for two “up-market” providers, BypassCaptcha and 
BeatCaptchas. In both cases, this relationship allows 
the CAPTCHA-solving services to charge higher rates: 
roughly $7/1,000 for BypassCaptcha and BeatCaptchas, 
and over $20/1,000 for Image2Type. It also provides an 
ongoing revenue source for the software developer. For 
his service, MR. E confirms that software partners bring 
in many customers (indeed, they are the majority revenue 
source) and that he offers a variety of revenue sharing op- 
tions to attract such partners. 

However, such large price differences encourage arbi- 
trage, and in some cases third-party developers have cre- 
ated plug-ins to allow the use of cheaper services on such 
packages. Indeed, in the case of GYC Automator, an in- 
dependent developer built a DeCaptcher plug-in which 
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reduced the solving cost by over an order of magnitude. 
This development has created an ongoing conflict be- 
tween the seller of GYC Automator and the distributor of 
the DeCaptcher plug-in. Other software developers have 
chosen to forgo large margin revenue sharing in favor of 
service diversity. For example, modern versions of the 
Xrumer package can use multiple price-leading services 
(Antigate and CaptchaBot). 

Finally, while it is challenging to measure profitability 
directly, we have one anecdotal data point. In our discus- 
sions with MR. E, whose service is in the middle of the 
price spectrum, he indicated that routinely 50% of his 
revenue is profit, roughly 10% is for servers and band- 
width, and the remainder is split between solving labor 
and incentives for partners. 


4.4 Active Measurement Issues 


The remainder of our paper focuses on active measure- 
ment of such services, both by paying for solutions and 
by participating in the role of a CAPTCHA-solving la- 
borer. The security community has become increasingly 
aware of the need to consider the legal and ethical context 
of its actions, particularly for such active involvement, 
and we briefly consider each in turn for this project. 

In the United States (we restrict our brief discussion to 
U.S. law since that is where we operate), there are sev- 
eral bodies of law that may impinge on CAPTCHA solv- 
ing. First, even though the services being protected are 
themselves “free”, it can be argued that CAPTCHAs are 
an access control mechanism and thus evading them ex- 
ceeds the authorization granted by the site owner, in po- 
tential violation of the Computer Fraud and Abuse Act 
(and certainly of their terms of service). While this in- 
terpretation is debatable, it is a moot point for our study 
since we never make use of solved CAPTCHAs and thus 
never access any of the sites in question. A trickier issue 
is raised by the Digital Millennium Copyright Act’s anti- 
circumvention clause. While there are arguments that 
CAPTCHA Solvers provide a real use outside circumven- 
tion of copyright controls (e.g., as aids for the visually 
impaired) it is not clear—especially in light of increas- 
ingly common audio CAPTCHA options—that such a de- 
fense is sufficient to protect infringers. Indeed, Ticket- 
master recently won a default judgment against RMG 
Technologies (who sold automated software to bypass 
the Ticketmaster CAPTCHA) using just such an argu- 
ment [2]. That said, while one could certainly apply the 
DMCA against those offering a service for CAPTCHA- 
solving purposes, it seems a stretch to include individual 
human workers as violators since any such “circumven- 
tion” would include innate human visual processes. 

Aside from potential legal restrictions, there are also 
related ethical concerns; one can do harm without such 
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actions being illegal. In considering these questions, we 
use a consequentialist approach — comparing the con- 
sequences of our intervention to an alternate world in 
which we took no action — and evaluate the outcome 
for its cost-benefit tradeoff. 

On the purchasing side, we impart no direct impact 
since we do not actually use the solutions on their respec- 
tive sites. We do have an indirect impact however since, 
through purchasing services, we are providing support 
to both workers and service providers. In weighing this 
risk, we concluded that the indirect harm of our relatively 
small investment was outweighed by the benefits that 
come from better understanding the nature of the threat. 
On the solving side, the ethical questions are murkier 
since we understand that solutions to such CAPTCHAS 
will be used to circumvent the sites they are associated 
with. To sidestep this concern, we chose not to solve 
these CAPTCHAs ourselves. Instead, for each CAPTCHA 
one of our worker agents was asked to solve, we proxied 
the image back into the same service via the associated 
retail interface. Since each CAPTCHA is then solved by 
the same set of solvers who would have solved it any- 
way, we argue that our activities do not impact the gross 
outcome. This approach does cause slightly more money 
to be injected into the system, but this amount is small. 

Finally, we consulted with our human subjects liaison 
on this work and we were told that the study did not re- 
quire approval. 


5 Solver Service Quality 


In this section we present our analysis of CAPTCHA- 
solving services based on actively engaging with a range 
of services as a client. We evaluate the customer inter- 
face, solution accuracy, response time, availability, and 
capacity of the eight retail CAPTCHA-solving services 
listed in Table 1. 

We chose these services through a combination of Web 
searching and reading Web forums focused on “black- 
hat” search-engine optimization (SEO). In October of 
2009, we selected the eight listed in Table | because 
they were well-advertised and reflected a spectrum of 
price offerings at the time. Over the course of our study, 
two of the services (CaptchaGateway and CaptchaBy- 
pass) ceased operation—we suspect because of compe- 
tition from lower-priced vendors. 


5.1 Customer Account Creation 


For most of these services, account registration is accom- 
plished via a combination of the Web and e-mail: con- 
tact information is provided via a Web site and subse- 
quent sign-up interactions are conducted largely via e- 
mail. However, most services presented some obstacles 
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Service $/1K Bulk Dates (2009-2010) Requests Responses 
Antigate (AG) $1.00 Oct 06 — Feb 01 (118 days) 28,210 27,726 (98.28%) 
BeatCaptchas (BC) $6.00 Sep 21 — Feb 01 (133 days) 28,303 25,708 (90.83%) 
BypassCaptcha (BY) $6.50 Sep 23 — Feb 01 (131 days) 28,117 27,729 (98.62%) 
CaptchaBot (CB) $1.00 Oct 06 — Feb 01 (118 days) 28,187 22,677 (80.45%) 
CaptchaBypass (CP) $5.00 Sep 23 — Dec 23 (91 days) 17,739 = 15,869 (89.46%) 
CaptchaGateway (CG) $6.60 Oct 21 — Nov 03 (13 days) 1,803 1,715 (95.12%) 


DeCaptcher (DC) $2.00 
ImageToText (IT) $20.00 


Sep 21 — Feb O1 (133 days) 28,284 24,411 (86.31%) 
Oct 06 — Feb 01 (118 days) 14,321 13,246 (92.49%) 


Table 1: Summary of the customer workload to the CAPTCHA-solving services. 


to account creation, reflecting varying degrees of due 
diligence. 

For example, both CaptchaBot and Antigate required 
third-party “invitation codes” to join their services, 
which we acquired from the previously mentioned fo- 
rums. Interestingly, Antigate guards against Western 
users by requiring site visitors to enter the name of 
the Russian prime minister in Cyrillic before grant- 
ing access—an innovation we refer to as a “culturally- 
restricted CAPTCHA”.’ Some services require a live 
phone call for account creation, for which we used an 
anonymous mobile phone to avoid any potential biases 
arising from using a University phone number. In our ex- 
perience, however, the burden of proof demanded is quite 
low and our precautions were likely unnecessary. For ex- 
ample, setting up an ImageToText account required a val- 
idation call, but the only question asked was “Did you 
open an account on ImageToText?” Upon answering in 
the affirmative (in a voice clearly conflicting with the 
gender of the account holder’s name), our account was 
promptly enabled. For one service, DeCaptcher, we cre- 
ated multiple accounts to evaluate whether per-customer 
rate limiting is in use (we found it was not). 

Finally, each service typically requires prepayment by 
customers, in units defined by their price schedule (1,000 
CAPTCHAS is the smallest “package” generally offered). 
To fund each account, we used prepaid VISA gift cards 
issued by a national bank unaffiliated with our university. 


5.2 Customer Interface 


Most services provide an API package for uploading 
CAPTCHAS and receiving results, often in multiple pro- 
gramming languages; we generally used the PHP-based 
APIs. BeatCaptchas and BypassCaptcha did not offer 


1In principle, such an approach could be used to artificially restrict 
labor markets to specific cultures (1.e., CAPTCHA labor protectionism). 
However it is an open problem if such a general form of culturally- 
restricted CAPTCHA can be devised that has both a large number of 
examples and a low false reject rate from its target population. 
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pre-built API packages, so we implemented our own API 
in Ruby to interface with their Web sites. The client APIs 
generally employ one of two methods when interacting 
with their corresponding services. In the first, the API 
client performs a single HTTP POST that uploads the im- 
age to the service, waits for the CAPTCHA to be solved, 
and receives the answer in the HTTP response; Beat- 
Captchas, BypassCaptcha, CaptchaBypass and Captch- 
aBot utilize this method. 

In the second, the client performs one HTTP POST to 
upload the image, receives an image ID in the response, 
and subsequently polls the site for the CAPTCHA solu- 
tion using the image ID; Antigate, CaptchaGateway, and 
ImageToText employ this approach. These APIs recom- 
mend poll rates between 1-5 seconds; we polled these 
services once per second. DeCaptcher uses a custom pro- 
tocol that is not based on HTTP, although they also offer 
an HTTP interface. One interesting note about ImageTo- 
Text is that customers must verify that their API code 
works in a test environment before gaining access to the 
actual service. The test environment allows users to see 
the CAPTCHAs they submit and solve them manually. 


5.3. Service Pricing 


Several of the services, notably Antigate and De- 
Captcher, offer bidding systems whereby a customer can 
offer payment over the market rate in exchange for higher 
priority access to solvers when load is high. In our ex- 
perience, DeCaptcher charges customers their full bid 
price, while Antigate typically charges at a lower rate de- 
pending on load (as might happen in a second-price auc- 
tion). To effectively use Antigate, we set our bid price to 
$2/1,000 solutions since we experienced a large volume 
of load shedding error codes at the minimum bid price 
of $1/1,000 (Section 5.9 reports on our experiences with 
service load in more detail). We have not seen price fluc- 
tuations on the worker side of these services, and thus 
we believe that this overage represents pure profit to the 
service provider. 
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5.4 Test Corpus 


We evaluated the eight CAPTCHA-solving services in Ta- 
ble 1 as a customer over the course of about five months 
using a representative sample of CAPTCHAS employed 
by popular Web sites. To collect this CAPTCHA work- 
load, we assembled a list of 25 popular Web sites with 
unique CAPTCHAs based on the Alexa rank of the site 
and our informal assessment of its value as a target (see 
Figure 5 for the complete list). We also used CAPTCHAS 
from reCaptcha, a popular CAPTCHA provider used by 
many sites. We then collected about 7,500 instances of 
each CAPTCHA directly from each site. For the capacity 
measurement experiments (Section 5.8), we used 12,000 
instances of the Yahoo CAPTCHA graciously provided to 
us by Yahoo. 


5.5 Verifying Solutions 


To assess the accuracy of each service, we needed to de- 
termine the correct solution for each CAPTCHA in our 
corpus. We used the services themselves to do this for 
us. For each instance, we used the most frequent solution 
returned by the solver services, after normalizing cap- 
italization and whitespace. If there was more than one 
most frequent solution, we treated all answers as incor- 
rect (taking this to mean that the CAPTCHA had no cor- 
rect solution). Table 1 shows the overall accuracy of each 
service as given by our method. 

To validate this heuristic, we randomly selected 1,025 
CAPTCHAsS having at least one service-provided solution 
and manually examined the images. Of these, we were 
able to solve 1,009, of which 940 had a unique plural- 
ity that agreed with our solution, giving an error rate 
for the heuristic of just over 8%. Of the 16 CAPTCHAS 
(1.6%) we could not solve, seven were entirely unread- 
able, six had ambiguous characters (e.g., ‘0’ vs. ‘o’, ‘6’ 
vs. ‘b’), and three were rendered ambiguous due to over- 
lapping characters. (We note that Bursztein et al. [3] re- 
moved CAPTCHAs with no majority from their calcula- 
tion, which resulted in a higher estimated accuracy than 
we found in our study.) 


5.6 Quality of Service 


To assess the accuracy, response time, and service avail- 
ability of the eight CAPTCHA solving services, we con- 
tinuously submitted CAPTCHAs from our corpus to each 
service over the course of the study. We submitted a 
single CAPTCHA every five minutes to all services si- 
multaneously, recording the time when we submitted the 
CAPTCHA and the time when we received the response. 
Recall that ImageToText, Antigate and CaptchaGateway 
require customers to poll the service for the response to 
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Figure 4: Median error rate and response time (in seconds) for 
all services. Services are ranked top-to-bottom in order of in- 
creasing error rate. 
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Figure 6: Median error rate and response time (in seconds) for 
all CAPTCHAS. CAPTCHAS are ranked top-to-bottom in order of 
increasing error rate. 


a submitted CAPTCHA; we paused one second between 
each poll call. 

Table 1 also summarizes the dates, durations, and 
number of CAPTCHA requests we submitted to the ser- 
vices; Figure 5 presents the error rate and mean response 
time at a glance for each combination of solver service 
and CAPTCHA type. We used each service for up to 118 
days, submitting up to 28,303 requests per service during 
that period. We were not able to submit the same num- 
ber of CAPTCHAs to all services for a number of rea- 
sons. For example, services would go offline temporar- 
ily, or we would rewrite parts of our client implementa- 
tion, thus requiring us to temporarily remove the service 
from the experiment. Furthermore, CaptchaGateway and 
CaptchaBypass ceased operation during our study. 
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Figure 5: Error rate and median response time for each combination of service and CAPTCHA type. The area of each circle upper 
table is proportional to the error rate (among solved CAPTCHAs). In the lower table, circle area is proportional to the response time 
minus ten seconds (for increased contrast); negative values are denoted by unshaded circles. Numeric values corresponding to the 
values in the leftmost and rightmost columns are shown on the side. Thus, the error rate of BypassCaptcha on Youku CAPTCHAs is 
66%, and for BeatCaptchas on PayPal 4%. The median response time of CaptchaGateway on Youku is 21 seconds, and 8 seconds 


for Antigate on PayPal. 


Accuracy 


A CAPTCHA solution is only useful if it 1s correct. The 
left bar plot in Figure 4 shows the median error rate for 
each service. Overall the services are reasonably accu- 
rate: with the exception of BypassCaptcha, 86-89% of 
responses ® were correct. This level of accuracy is in line 
with results reported by Bursztein et al. [3] for human 
solvers and substantially better than the accuracy of re- 
CaptchaOCR (Section 3). 

By design, CAPTCHAs vary in difficulty. Do the ob- 
served error rates reflect such differences? The top half 
of Figure 5 shows service accuracy (in terms of its er- 
ror rate) on each CAPTCHA type. The area of each circle 
is proportional to a service’s mean error rate on a par- 
ticular CAPTCHA type. Services are arranged along the 
y-axis in order of increasing accuracy, with the most ac- 
curate (lowest error rate) at the top and the least accurate 
(highest error rate) at the bottom. CAPTCHA types are ar- 
ranged in decreasing order of their median error rate. The 
median error rate of each type is also shown in Figure 6. 

Accuracy clearly depends on the type of CAPTCHA. 
The error rate for ImageToText with Youku, for instance, 
is 5 times its PayPal error rate. Furthermore, the ranking 
of CAPTCHA accuracies are generally consistent across 


8The error rate is over received responses and does not include re- 
jected requests. We consider response rate to be a measure of availabil- 
ity rather than accuracy. 
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the services—all services have relatively poor accuracy 
on Youku and good accuracy on PayPal. 

Based on the data, one might conclude that a group 
of CAPTCHAs on the left headed by Youku, reCaptcha, 
Slashdot, and Taobao are “harder” than the rest. How- 
ever an important factor affecting solution accuracy (as 
well as response time) in our measurements 1s worker fa- 
miliarity with a CAPTCHA type. In the case of Youku, for 
instance, workers may simply be unfamiliar with these 
CAPTCHAsS. On the other hand, workers are likely famil- 
lar with reCaptcha CAPTCHAs (see Section 6.6), which 
may genuinely be “harder” than the rest. As a point of 
comparison, MR. E reported in our interview that his ser- 
vice experiences a 5—10% error rate. Since his CAPTCHA 
mix is likely different, and less diverse, than our full set, 
his claim seems reasonable. 


Response Time 


In addition to accuracy, customers want services that 
solve CAPTCHAS quickly. Figure 7 shows the cumulative 
distribution of response times of each service. The curves 
of CaptchaBot, CaptchaBypass, ImageToText, and Anti- 
gate exhibit the quantization effect of polling—either in 
the client API or on the server—as a Stair-step pattern. 
The shape of the distributions is characteristically log- 
normal, with a median response of 14 seconds (across 
all services) and a third-quartile response time of 20 
seconds—well within the session timeout of most Web 
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sites. For convenience, Figure 4 also shows median re- 
sponse times for each service. In contrast to Bursztein et 
al. [3], who used a different labor pool (Amazon Me- 
chanical Turk), we found no significant difference in re- 
sponse times of correct and incorrect responses. 

Services differ considerably in the relative response 
times they provide to their customers. Antigate (for 
which we paid a slight premium for priority service as 
described in Section 5.3) and ImageToText provided the 
fastest service with median response times of 9.6 seconds 
and 9.4 seconds, respectively, with 90% of CAPTCHAS 
solved under 25 seconds. CaptchaGateway was the slow- 
est service we measured, with a median of 21.3 seconds 
and 10% of responses taking over a minute; it was also 
one of the two services that ceased operation during our 
study. The remaining services fall in between those ex- 
tremes. MR. E reported that his service trains workers 
to achieve response times of 10—12 seconds on average, 
which is consistent with our measurements of his service. 

DeCaptcher and BeatCaptchas have very similar dis- 
tributions. We have seen evidence (1.e., error messages 
from BeatCaptchas that are identical to ones documented 
for the DeCaptcher API) that suggests that BeatCaptchas 
uses DeCaptcher as a back end. Antigate returns some 
correct responses unusually quickly (a few seconds), for 
which we currently do not have an explanation; we have 
ruled out caching effects. 

Services have an advantage if they have better re- 
sponse times than their competition, and the services we 
measured differ substantially. We suspect that it is a com- 
bination of two factors: software and queueing delay in 
the service infrastructure, and worker efficiency. Anti- 
gate, for instance, appears to have an unusually large la- 
bor pool (Section 5.8), which may enable them to keep 
queueing delay low. Similarly, ImageToText appears to 
have an adaptive, high-quality labor pool (Section 6.4). 
We observed additional delays of 5 seconds due to load 
(Section 5.9), but load likely affects all services similarly. 

We found that accuracy varied with the type of 
CAPTCHA. A closely related issue is to what degree re- 
sponse time also varies according to CAPTCHA type. The 
bottom of Figure 5 shows response times by CAPTCHA 
type. Services are listed along the y-axis from slowest 
(top) to fastest service (bottom). The area of each circle 
is proportional to the median response time of a service 
on a particular CAPTCHA type minus ten seconds (for 
greater contrast). Shaded circles are times in excess of 
ten seconds, unshaded circles are times less than ten sec- 
onds. For example, the median response time of Antigate 
on PayPal CAPTCHAs—8 seconds—is shown as an un- 
shaded circle. Note that CAPTCHA types are still sorted 
by accuracy. The right half of Figure 4 aggregates re- 
sponse times by service, showing the median response 
time of each. 
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service. 
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Figure 8: Price for 1,000 correctly-solved CAPTCHAs within a 
given response time threshold. 


We see some variation in response time among 
CAPTCHA types. Youku and reCaptcha, for instance, 
consistently induce longer response times across ser- 
vices, whereas Baidu, eBay, and QQ consistently have 
shorter response times. However, the variation in re- 
sponse times among the services dominates the varia- 
tion due to CAPTCHA type. The fastest CAPTCHAs that 
DeCaptcher solves (e.g., Baidu and QQ) are slower on 
average than the slowest CAPTCHAs that Antigate and 
ImageToText solve. 


5.7 Value 


CAPTCHA solvers differ in terms of accuracy, response 
time, and price. The value of a particular solver to a 
customer depends upon the combination of all of these 
factors: a customer wants to pay the lowest price for 
both fast and accurate CAPTCHAs. For example, sup- 
pose that a customer wants to create 1,000 accounts on 
an Internet service, and the Internet service requires that 
CAPTCHAsS be solved within 30 seconds. When using a 
CAPTCHA solver, the customer will have to pay to have 
at least 1,000 CAPTCHAs solved, and likely more due to 
solutions with response times longer than the 30-second 
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Figure 9: Load reported by (a) Antigate and (b) DeCaptcher as a function of time-of-day in one-hour increments. For comparison, 
we show the percentage of correct responses and rejected requests per hour, as well as the average response time per hour. 


threshold (recall that customers do not have to pay for in- 
correct solutions). From this perspective, the solver with 
the best value may not be the one with the cheapest price. 

Figure 8 explores the relationship among accuracy, re- 
sponse time, and price for this scenario. The x-axis is 
the time threshold 7’ within which a CAPTCHA is useful 
to a customer. The y-axis is the adjusted price per bun- 
dle of 1,000 CAPTCHAs that are both solved correctly 
and solved within time 7’. Each curve corresponds to a 
solver. Each solver charges a price per CAPTCHA solved 
(Table 1), but not all solved CAPTCHAs will be useful to 
the customer. The adjusted price therefore includes the 
overhead of solving CAPTCHAs that take longer than 7’ 
and are effectively useless. Consider an example where a 
customer wants to have 1,000 correct CAPTCHAs solved 
within 30 seconds, a solver charges $2/1,000 CAPTCHAs, 
and 70% of the solver’s CAPTCHA responses are cor- 
rect and returned within 30 seconds. In this case, the 
customer will effectively pay an adjusted price of $2 x 
(1/0.70) = $2.86/1, 000 useful CAPTCHAs. 

The results in Figure 8 show that the solver with the 
best value depends on the response time threshold. For 
high thresholds (more than 25 seconds), both Antigate 
and CaptchaBot provide the best value and ImageToText 
is the most expensive as suggested by their bulk prices 
(Table 1). However, below this threshold the rankings be- 
gin to change. Antigate begins to have better value than 
CaptchaBot due to having consistently better response 
times. In addition, ImageToText starts to overtake the 
other services. Even though its bulk price is 5x that of 
DeCaptcher, for instance, its service is a better value for 
having CAPTCHAs solved within 8 seconds (albeit at a 
premium adjusted price). 


5.8 Capacity 


Another point of differentiation is solver capacity, 
namely how many CAPTCHAs a Service can solve in a 
given unit of time. In addition to low-rate measurements, 
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we also attempted to measure a service’s maximum Cca- 
pacity using bursts of CAPTCHA requests. Specifically, 
we measured the number and rate of solutions returned 
in response to a given offered load, substantially increas- 
ing the load in increments until the service appeared 
overloaded. We carried out this experiment successfully 
for five of the services. Of them, Antigate had by far 
the highest capacity, solving on the order of 27 to 41 
CAPTCHAS per second. Even at our highest sustained of- 
fered load (1,536 threads submitting CAPTCHAs simulta- 
neously, bid set at $3/1,000), our rejection rate was very 
low, suggesting that Antigate’s actual capacity may in 
fact be higher. Due to financial considerations, we did 
not attempt higher offered loads. 

For the remaining services, we exceeded their avail- 
able capacity. We took a non-negligible reject rate to 
be an indicator of the service running at full capacity. 
Both DeCaptcher and CaptchaBot were able to sustain a 
rate of about 14-15 CAPTCHAs per second, with Beat- 
Captchas and BypassCaptchas sustaining a solve rate of 
eight and four CAPTCHAs per second, respectively. 

Based on these rates, we can calculate a rough esti- 
mate of the number of workers at these services. Assum- 
ing 10-13 seconds per CAPTCHA (based on our inter- 
view with MR. E, and consistent with our measured la- 
tencies of his service in the 10—20 second range), Anti- 
gate would have had at least 400-500 workers avail- 
able to service our request. Since we did not exceed 
their available capacity, the actual number may be larger. 
Both DeCaptcher and CaptchaBot, at a solve rate of 15 
CAPTCHAS per second mentioned above, would have had 
130-200 workers available. 


5.9 Load and Availability 


Customers can poll the transient load on the services and 
offer payment over the market rate in exchange for higher 
priority access when load is high. During our background 
CAPTCHA data collection for these services, we also 
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recorded the transient load that they reported. From these 
measurements, we can examine to what extent services 
report substantial load, and correlate reported load with 
other observable metrics (response time, reject rate) to 
evaluate the validity of the load reports. Because De- 
Captcher charges the full customer bid independent of 
actual load, for instance, it might be motivated to report 
a false high load in an attempt to encourage higher bids 
from customers. 

Figure 9 shows the average reported load as a function 
of the time of day (in the US Pacific time zone) for both 
services: for each hour, we compute the average of all 
load samples taken during that hour for all days of our 
data set. Antigate reports a higher nominal background 
load than DeCaptcher, but both services clearly report a 
pronounced diurnal load effect. 

For comparison, we also overlay three other ser- 
vice metrics for each hour across all days: average re- 
sponse time of solved CAPTCHAs, percentage of submit- 
ted CAPTCHAS rejected by the service, and the percent- 
age of responses with correct solutions. Response time 
correlates with reported load, increasing by 5 seconds 
during high load for each service—suggesting that the 
high load reports are indeed valid. The percentage of re- 
jected requests for DeCaptcher further validates the load 
reports. When our bids to DeCaptcher were at the base 
price of $2/1,000 at times of high load, DeCaptcher ag- 
gressively rejected our work requests. To confirm that a 
higher bid resulted in lower rejection rates, we measured 
available capacity at SPM (US Pacific time) at the base 
price of $2 and then, a few minutes later, at $5, obtaining 
solve rates of 8 and 18 CAPTCHAs per second, respec- 
tively. Although not conclusive, this experience suggests 
that higher bids may be necessary to achieve a desired 
level of service at times of high load. Likewise, Antigate 
exhibits better quality of service when bidding $1 over 
the base price, though bidding over this amount produced 
no noticeable improvement (we tested up to $6/1,000). 

As further evidence, recall that for Antigate we had to 
offer premium bids before the service would solve our re- 
quests (Section 5.2). As a result, even during high loads 
Antigate did not reject our requests, presumably priori- 
tizing our requests over others with lower bids. 

Finally, as expected, accuracy is independent of load: 
workers are shielded from load behind work queues, 
solving CAPTCHAs to their ability unaffected by the of- 
fered load on the system. 


6 Workforce 


Human CAPTCHA solving services are effectively aggre- 
gators. On one hand, they aggregate demand by provid- 
ing a singular point for purchasing solving services. At 
the same time, they aggregate the labor supply by provid- 
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Figure 10: Portion of a PixProfit worker interface displaying a 
Microsoft CAPTCHA. 


ing a singular point through which workers can depend 
on being offered consistent CAPTCHA solving work for 
hire. Thus, for each of the publicly-facing retail sites de- 
scribed previously, there is typically also a private “job 
site” accessed by workers to receive CAPTCHA images 
and provide textual solutions. Identifying these job sites 
and which retail service they support is an investigative 
challenge. For this study, we focused our efforts on two 
services for which we feel confident about the mapping: 
Kolotibablo and PixProfit. Kolotibablo is a Russian-run 
job site that supplies solutions for the retail service Anti- 
gate (which, along with CaptchaBot, is the current price 
leader). 


6.1 Account Creation 


For each job site, account creation is similar to the retail 
side, but due diligence remains minimal. As a form of 
quality control, some job sites will evaluate new work- 
ers using a corpus of “test” CAPTCHAs (whose solutions 
are known a priori) before they allow them to solve ex- 
ternally provided CAPTCHAS. For this reason, we discard 
the first 30 CAPTCHAs provided by PixProfit, which we 
learned by experience correspond to test CAPTCHAS. 


6.2 Worker Interface 


Services provide workers with a Web based interface 
that, after logging in, displays CAPTCHAs to be solved 
and provides a text box for entering the solution (Fig- 
ure 10 shows an example of the interface for PixProfit). 
Each site also tracks the number of CAPTCHAs solved, 
the number that were reported as correct (by customers 
of the retail service), and the amount of money earned. 
PixProfit also assigns each worker a “priority” based 
on solution accuracy. Better accuracy results in more 
CAPTCHAsS to solve during times of lower load. If a 
solver’s accuracy decreases too much, services ban the 
account. In our experiments, our worker agents always 
used fresh accounts with the highest level of priority. 
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Language Example AG 
English one two three 51.1 
Chinese (Simp.) — = == 484 
Chinese (Trad.) — = == = 52.9 
Spanish uno dos tres 1.81 
Italian uno due tre 3.65 
Tagalog isa dalawa tatlo 0.00 
Portuguese um dois trés 3.15 
Russian OAVH ABA TAU 24.1 
Tamil por Qyonh eaperm 2.96 
Dutch een twee drie 4.09 
Hindi wp al dH 105 
German eins zwei drei 3.62 
Malay satu dua tiga 000 
Vietnamese mot hai ba 046 
Korean 2 oO] AF 0.00 
Greek Eva OUO TPIA 0.45 
Arabic asly Cail D4 9.90 
Bengali ap 8 fea 0.45 
Kannada woes ae adad 0.91 
Klingon SF co E& 0.00 
Farsi Se 52 aA» 0.45 


BC BY CB DC IT. Aun 
37.6 4.76 40.6 39.0 62.0 39.2 
31.0 0.00 68.9 26.9 35.8 35.2 
244 0.00 63.8 30.2 33.0 34.1 
13.8 0.00 2.90 7.78 56.8 13.9 
845 0.00 465 5.44 57.1 132 
5.79 0.00 0.00 7.84 57.2 11.8 
10.1 0.00 148 3.98 489 11.3 
0.00 0.00 114 0.55 16.5 8.76 
21.1 3.26 0.74 12.1 536 7.47 
1.36 0.00 0.00 1.22 31.1 6.30 
5.38 247 152 630 9.49 5.94 
0.72 0.00 146 0.58 29.1 5.91 
1.42 0.00 0.00 0.55 29.4 5.23 
2.07 0.00 0.00 1.74 181 3.72 
0.00 0.00 0.00 0.00 20.2 3.37 
0.00 0.00 0.00 0.00 15.5 2.65 
0.00 0.00 0.00 0.00 15.3 2.56 
0.00 9.89 0.00 0.00 0.00 1.72 
0.00 0.00 0.00 0.55 614 1.26 
0.00 0.00 0.00 0.00 1.12 0.19 
0.00 0.00 0.00 0.00 0.00 0.08 


Table 2: Percentage of responses from the services with correct answers for the language CAPTCHAS. 


6.3 Worker Wages 


Kolotibablo pays workers at a variable rate depending on 
how many CAPTCHAs they have solved. This rate varies 
from $0.50/1,000 up to over $0.75/1,000 CAPTCHAs. 
PixProfit is the equivalent supplier for DeCaptcher and 
offers a somewhat higher rate of $1/1,000. Typically, 
workers must earn a minimum amount of money be- 
fore payout ($3.00 at PixProfit and $1.00 at Kolotibablo), 
and services commonly provide payment via an online e- 
currency system such as WebMoney. 

While we cannot directly measure the gross wages 
paid by either service, Kolotibablo provides a public list 
to its workers detailing the monthly earnings for the top 
100 solvers each day (presumably as a worker incentive). 
We monitored these earnings for two months beginning 
on Dec. Ist, 2009. On this date, the average monthly 
payout among the top 100 workers was $106.31. How- 
ever, during December, Kolotibablo revised its bonus 
payout system, which reduced the payout range by ap- 
proximately 50% (again reflecting downward price pres- 
sure on CAPTCHA-Solving labor). As a result, one month 
later on Jan. Ist, 2010, the average monthly payout to 
the top 100 earners decreased to $47.32. In general, 
these earnings are roughly consistent with wages paid to 
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low-income textile workers in Asia [12], suggesting that 
CAPTCHA-solving is being outsourced to similar labor 
pools; we investigate this question next. 


6.4 Geolocating Workers 


We crafted CAPTCHAs whose solutions would reveal 
information about the geographic demographics of the 
CAPTCHA solvers. We created CAPTCHAS using words 
corresponding to digits in the native script of various 
languages (“uno’’, “dos”, “tres’’, etc., for the CAPTCHA 
challenge in Spanish), where the correct solution is the 
sequence of Roman numerals corresponding to those 
words (“‘1”’, “2”, “3”, etc.) for any CAPTCHA in any lan- 
guage. Ideally, such CAPTCHAs should be easy to grasp 
and fast to solve by the language’s speakers, yet substan- 
tially less likely to be solved by non-speakers or random 
chance. We expect a measurably high accuracy for ser- 
vices employing workers familiar with those languages. 

Table 2 lists the languages we used in this experiment 
along with an example three-digit CAPTCHA in the lan- 
guage corresponding to the solution “123”. For broad 
global coverage, we selected 21 languages based on a 
combination of factors including global exposure (En- 


USENIX Association 


glish), prevalence of world-wide native speakers (Chi- 
nese, Spanish, English, Hindi, Arabic), regions of ex- 
pected low-cost labor markets with inexpensive Inter- 
net access (India, China, Southeast Asia, Latin America), 
and developed regions unlikely to be sources of afford- 
able CAPTCHA labor (e.g., Western Europe) and lastly 
one synthetic language as a control (Klingon [15]). 

The CAPTCHA we submitted had instructions in the 
language for how to solve the CAPTCHA (e.g., “Por favor 
escriba los nimeros abajo” for Spanish), as well as an 
initial word and Roman numeral as a concrete example 

“uno”, “1”’). In our experiments, we randomly generated 
222 unique CAPTCHAS in each language and submitted 
them to the six services still operating in January 2010. 
We rotated through languages such that we submitted a 
CAPTCHA in this format once every 20—25 minutes. The 
CAPTCHAsS did not repeat digits to reduce the correlated 
effect of a random guess. As a result, the actual proba- 
bility for guessing a CAPTCHA 1s 1/504 (9 x 8 x 7, re- 
duced by 1 due to the example), although workers un- 
aware of the construction would still be making guesses 
out of 1,000 possibilities. 

Table 2 also shows the accuracy of the services when 
presented with these CAPTCHAs. The accuracy corre- 
sponds to a response with all three digits correct (since 
we created them we have their ground truth). For a con- 
venient ordering, we sort the languages by the average 
accuracy across all services. 

The results paint a revealing picture. First, although 
Roman alphanumerics in typical CAPTCHAs are glob- 
ally comprehensible—and therefore easily outsourced— 
English words for numerals represent a noticeable se- 
mantic gap for presumably non-English speakers. Very 
high accuracies on normal CAPTCHAs drop to 38-62% 
for the challenge presented in English. 

Second, workers at a number of the services exhibit 
strong affinities to particular languages. Five of the ser- 
vices have accuracies for Chinese (Traditional and Sim- 
plified) either substantially higher or nearly as high as 
English. The services evidently include a sizeable work- 
force fluent in Chinese, likely mainland China with avail- 
able low-cost labor. In addition, Antigate has apprecia- 
ble accuracies for Russian and Hindi, presumably draw- 
ing on workforces in Russia and India. Similarly for 
CaptchaBypass and Russian; BeatCaptcha and Tamil, 
Portuguese, and Spanish; and DeCaptcher and Tamil. 
Other non-trivial accuracies in Bengali and Tagalog sug- 
gest further recruitment in India and southeast Asia. Ser- 
vices with non-trivial accuracies in Portuguese, Spanish, 
and Italian could be explained by a workforce familiar 
with one language who can readily deduce similar words 
in the other Romance languages. Consistent with these 
observations, MR. E reported in our interview that they 
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Figure 11: Custom Asirra CAPTCHA: workers must type the 
letters corresponding to pictures of cats. 


draw from labor markets in China, India, Bangladesh, 
and Vietnam. 

Finally, the results for ImageToText are impressive. 
Relative to the other services, ImageToText has appre- 
ciable accuracy across a remarkable range of languages, 
including languages where none of the other services 
had few if any correct solutions (Dutch, Korean, Viet- 
namese, Greek, Arabic) and even two correct solutions 
of CAPTCHAs in Klingon. Either ImageToText recruits a 
truly international workforce, or the workers were able to 
identify the CAPTCHA construction and learn the correct 
answers. ImageToText is the most expensive service by 
a wide margin, but clearly has a dynamic and adaptive 
labor pool. 


Time Zone. As another approach for using CAPTCHAS 
to reveal demographic information about workers—in 
this case, their time zone—we translated the following 
instruction into 14 of the languages as CAPTCHA im- 
ages: “Enter the current time’. We sent these CAPTCHAS 
to each of the six services at the same rate as the other 
language CAPTCHAs with numbers. We received 15,775 
responses, with the most common response being a re- 
type of the instruction in the native language. Of the re- 
maining responses, we received 1,583 (10.0%) with an 
answer in a recognizable time format. Of those, 77.9% 
of them came from UTC+8, further reinforcing the esti- 
mation of a large labor pool from China; the two other 
top time zones were the Indian UTC+5.5 with 5.7% and 
Eastern Europe UTC+2 with 3.0%. 


6.5 Adaptability 


As a final assessment, we wanted to examine how both 
CAPTCHA services and solvers adapt to changes in state- 
of-the-art CAPTCHA generation. We focused on the re- 
cently proposed Asirra CAPTCHA [9], which is based 
on identifying pictures of cats and dogs among a set of 
12 images. Using the corpus of images provided by the 
Asirra authors, we hand crafted our own version of the 
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Kolotibablo (Antigate) 


Service #CAPTCHAS % Total % Cum. 
Microsoft 6,552 25.5% 25.5% 
Vkontakte.ru 5,908 23.0% 48.5% 
Mail.ru 3,607 14.0% 62.5% 
Captcha.ru 2,476 9.6% 72.2% 
reCaptcha 921 3.6% 75.8% 
Other (18 sites) 3680 14.3% 90.1% 
Unknown 2551 9.9% 100% 
Total 25,695 


PixProfit (DeCaptcher) 
Service #CAPTCHAS 9% Total % Cum. 
Microsoft 12,135 43.1% 43.1% 
reCaptcha 10,788 38.3% 81.4% 
Google 1,202 4.3% 85.7% 
Yahoo 1,307 3.7% 89.3% 
AOL 415 1.5% 90.8% 
Other (18 sites) 1086 3.9% 94.7% 
Unknown 1505 5.3% 100% 
| Total 28,166 


Table 3: The top 5 targeted CAPTCHA types on Kolotibablo and PixProfit, based on CAPTCHAs observed posing as workers. 
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Figure 12: ImageToText error rate for the custom Asirra 
CAPTCHA over time. 


CAPTCHA suitable for use with standard solver image 
APIs. Figure 11 shows an example. We wrote the in- 
structions “Find all cats” in English, Chinese (Simpl.), 
Russian and Hindi across the top, as the majority of 
the workers speak one of these languages. We submitted 
this image once every three minutes to all services over 
12 days. ImageToText displayed a remarkable adapt- 
ability to this new CAPTCHA type, successfully solv- 
ing the CAPTCHA on average 39.9% of the time. Fig- 
ure 12 shows the declining error rate for ImageToText; as 
time progresses, the workers become increasingly adept 
at solving the CAPTCHA. The next closest service was 
BeatCaptchas, which succeeded 20.4% of the time. The 
remaining services, excluding DeCaptcher, had success 
rates below 7%. 

Coincidentally, as we were evaluating our own ver- 
sion of the Asirra CAPTCHA, on January 17th, 2010 De- 
Captcher began offering an API method that supported it 
directly—albeit at $4 per 1,000 Asirra solves (double its 
base price). Microsoft had deployed the Asirra CAPTCHA 
on December 8th, 2009 on Club Bing. Demand for solv- 
ing this CAPTCHA was apparently sufficiently strong 
enough that DeCaptcher took only five weeks to incorpo- 
rate it into their service. We then performed the same ex- 
periment described above using the new DeCaptcher API 
method and received 1,494 responses. DeCaptcher suc- 
cessfully solved 696 (46.5%) requests with a median re- 
sponse time of 39 seconds, about 2.3 times its median of 
17 seconds for regular CAPTCHAs. DeCaptcher appears 
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to have factored in the longer solve times for the Asirra 
CAPTCHAS into the charged price. From what we can tell, 
though, DeCaptcher does not pay PixProfit workers dou- 
ble the amount for solving them, consequently increasing 
its profit margin on these new CAPTCHAS. 


6.6 ‘Targeted Sites 


Customers of CAPTCHA-solving services target a num- 
ber of different Web sites. Using our worker accounts 
on Kolotibablo and PixProfit, the public worker sites 
of Antigate and DeCaptcher, respectively, we can iden- 
tify which Web sites are targeted by the customers of 
these services. Over the course of 82 days we recorded 
over 25,000 CAPTCHAS from Kolotibablo and 28,000 
CAPTCHAS from PixProfit. 

To identify the Web sites from which these CAPTCHAS 
originated, we first grouped the CAPTCHAs by image di- 
mensions. Most groups consisted of a single CAPTCHA 
type, which we confirmed visually. We then attempted to 
identify the Web sites from which these CAPTCHAS were 
taken. In this manner we identified 90% of Kolotibablo 
CAPTCHAs and 94% of PixProfit CAPTCHAS. 

Table 3 shows the top five CAPTCHA types we ob- 
served on Kolotibablo and PixProfit, with the remaining 
identified CAPTCHA types (18 CAPTCHA in both cases) 
representing 14% and 4% of the CAPTCHA volume on 
Kolotibablo and PixProfit respectively. Both distribu- 
tions of CAPTCHA types are highly skewed: on PixProfit, 
the top two CAPTCHAsS types represent 81% of the vol- 
ume, with the top five accounting for 91%. Kolotibablo 
is not quite as concentrated, but the top five still account 
for 76% of its volume. 

Clearly the markets for the services are different. Al- 
though Microsoft 1s by far the most common target for 
both, PixProfit tailors to CAPTCHAs from large global 
services (Google, Yahoo, AOL, and MySpace) whereas 
Russian sites otherwise dominate Kolotibablo (VKon- 
takte.ru, Mail.ru, CAPTCHA.ru, Mamba.ru, and Yan- 
dex) — a demographic that correlates well with the ob- 
served worker fluency in Russian for Antigate (Table 2). 
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7 Discussion and Conclusion 


By design, CAPTCHAs are simple and easy to solve by 
humans. Their “low-impact” quality makes them attrac- 
tive to site operators who are wary of any defense that 
could turn away visitors. However, this same quality has 
made them easy to outsource to the global unskilled la- 
bor market. In this study, we have shed light on the 
business of solving CAPTCHAsS, showing it to be a well- 
developed, highly-competitive industry with the capac- 
ity to solve on the order of a million CAPTCHAs per 
day. Wholesale and retail prices continue to decline, sug- 
gesting that this is a demand-limited market; an asser- 
tion further supported by our informal survey of several 
freelancer forums where workers in search of CAPTCHA- 
solving work greatly outnumber CAPTCHA-solving ser- 
vice recruitments. One may well ask: Do CAPTCHAS ac- 
tually work? The answer depends on what it is that we 
expect CAPTCHAS to do. 
Telling computers and humans apart. The original 
purpose of CAPTCHAS 1s to distinguish humans from ma- 
chines. To this day, no completely general means of solv- 
ing CAPTCHAs has emerged, nor is the cat-and-mouse 
game of creating automated solvers viable as a business 
model. In this regard, then, CAPTCHAs have succeeded. 
Preventing automated site access. Today, the re- 
tail price for solving one million CAPTCHAs is as 
low as $1,000. Indeed, for well-motivated adversaries, 
CAPTCHAS are an acceptable cost of doing business 
when measured against the value of gaining access to the 
protected resource. E-mail spammers, for example, solve 
CAPTCHAs to gain access to Web mail accounts from 
which to send their advertisements, while blog spam- 
mers seek to acquire organic “clicks” and influence result 
placement on major search engines. Thus, in an absolute 
sense, CAPTCHAsS do not prevent large-scale automated 
site access. 
Limiting automated site access. However, it is short- 
sighted to evaluate CAPTCHAs as a defense in isolation. 
Rather, they exert friction on the underlying economic 
model and should be evaluated in terms of how effi- 
ciently they can undermine the attacker’s profitability. 
Put simply, a CAPTCHA reduces an attacker’s expected 
profit by the cost of solving the CAPTCHA. If the at- 
tacker’s revenue cannot cover this cost, CAPTCHAS as 
a defense mechanism have succeeded. Indeed, for many 
sites (e.g., low PageRank blogs), CAPTCHAs alone may 
be sufficient to dissuade abuse. For higher-value sites, 
CAPTCHAs place a utilization constraint on otherwise 
“free” resources, below which it makes no sense to target 
them. Taking e-mail spam as an example, let us suppose 
that each newly registered Web mail account can send 
some number of spam messages before being shut down. 
The marginal revenue per message is given by the aver- 
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age revenue per sale divided by the expected number of 
messages needed to generate a single sale. For pharma- 
ceutical spam, Kanich et al. [14] estimate the marginal 
revenue per message to be roughly $0.00001; at $1 per 
1,000 CAPTCHAS, a new Web mail account starts to break 
even only after about 100 messages sent.” 

Thus, CAPTCHAS naturally limit site access to those 

attackers whose business models are efficient enough to 
be profitable in spite of these costs and act as a drag on 
profit for all actors. Indeed, MR. E reported that while 
his service had thousands of customers, 75% of traffic 
was generated by a small subset of them (5-10). 
The role of CAPTCHAs today. Continuing our reason- 
ing, the profitability of any particular scam is a function 
of three factors: the cost of CAPTCHA-solving, the ef- 
fectiveness of any secondary defenses (e.g., SMS valida- 
tion, account shutdowns, additional CAPTCHA screens, 
etc.) and the efficiency of the attacker’s business model. 
As the cost of CAPTCHA solving decreases, a site oper- 
ator must employ secondary defenses more aggressively 
to maintain a given level of fraud. 

Unfortunately, secondary defenses are invariably more 
expensive both in infrastructure and customer impact 
when compared to CAPTCHAS. However, a key observa- 
tion is that secondary defenses need only be deployed 
quickly enough to undermine profitability (e.g., within a 
certain number of messages sent, accounts registered per 
IP, etc.). Indeed, the optimal point for this transition is 
precisely the point at which the attacker ’breaks even.” 
Before this point it is preferable to use CAPTCHAs to 
minimize the cost burden to the site owner and the poten- 
tial impact on legitimate users. While we do not believe 
that such economic models have been carefully devel- 
oped by site owners, we see evidence that precisely this 
kind of tradeoff is being made. For example, a number of 
popular sites such as Google are now making aggressive 
use of secondary mechanisms to screen account sign- 
ups (e.g., SMS challenges), but only after a CAPTCHA 1s 
passed and some usage threshold is triggered (e.g., mul- 
tiple sign-ups from the same IP address).!° 

In summary, we have argued that CAPTCHAs, while 
traditionally viewed as a technological impediment to 
an attacker, should more properly be regarded as an 
economic one, aS witnessed by a robust and mature 
CAPTCHA-solving industry which bypasses the underly- 


°These numbers should be taken with a grain of salt, both be- 
cause the cited study is but a single data point, and because they stud- 
ied SMTP-based spam, which generally has lower deliverability than 
Webmail-based spam. Anecdotally, the retail cost of Webmail-based 
delivery can be over 100 times more than via SMTP from raw bots. 

10 Anecdotally, this strategy appears effective for now and Gmail ac- 
counts on the underground market have gone from a typical asking 
price of $8/1,000, to being hard to come by at any price. We will not 
be surprised, however, if this mechanism leads to the monetization of 
smartphone botnets, or mobots [10], in response. 
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ing technological issue completely. Viewed in this light, 
CAPTCHAS are a low-impact mechanism that adds fric- 
tion to the attacker’s business model and thus minimizes 
the cost and legitimate user impact of heavier-weight sec- 
ondary defenses. CAPTCHAS continue to serve this func- 
tion, but as with most such defensive mechanisms, they 
simply work less efficiently over time. 
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Abstract 


Oppressive regimes and even democratic governments 
restrict Internet access. Existing anti-censorship systems 
often require users to connect through proxies, but these 
systems are relatively easy for a censor to discover and 
block. This paper offers a possible next step in the cen- 
sorship arms race: rather than relying on a single system 
or set of proxies to circumvent censorship firewalls, we 
explore whether the vast deployment of sites that host 
user-generated content can breach these firewalls. To ex- 
plore this possibility, we have developed Collage, which 
allows users to exchange messages through hidden chan- 
nels in sites that host user-generated content. Collage has 
two components: a message vector layer for embedding 
content in cover traffic; and a rendezvous mechanism 
to allow parties to publish and retrieve messages in the 
cover traffic. Collage uses user-generated content (e.g., 
photo-sharing sites) as “drop sites” for hidden messages. 
To send a message, a user embeds it into cover traffic and 
posts the content on some site, where receivers retrieve 
this content using a sequence of tasks. Collage makes it 
difficult for a censor to monitor or block these messages 
by exploiting the sheer number of sites where users can 
exchange messages and the variety of ways that a mes- 
sage can be hidden. Our evaluation of Collage shows 
that the performance overhead is acceptable for sending 
small messages (e.g., Web articles, email). We show how 
Collage can be used to build two applications: a direct 
messaging application, and a Web content delivery sys- 
tem. 


1 Introduction 


Network communication is subject to censorship and 
surveillance in many countries. An increasing number 
of countries and organizations are blocking access to 
parts of the Internet. The Open Net Initiative reports 
that 59 countries perform some degree of filtering [36]. 
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For example, Pakistan recently blocked YouTube [47]. 
Content deemed offensive by the government has been 
blocked in Turkey [48]. The Chinese government reg- 
ularly blocks activist websites [37], even as China has 
become the country with the most Internet users [19]; 
more recently, China has filtered popular content sites 
such as Facebook and Twitter, and even require their 
users to register to visit certain sites [43]. Even demo- 
cratic countries such as the United Kingdom and Aus- 
tralia have recently garnered attention with controversial 
filtering practices [35, 54, 55]; South Korea’s president 
recently considered monitoring Web traffic for political 
opposition [31]. 

Although existing anti-censorship systems—notably, 
onion routing systems such as Tor [18]—have allowed 
citizens some access to censored information, these sys- 
tems require users outside the censored regime to set up 
infrastructure: typically, they must establish and main- 
tain proxies of some kind. The requirement for running 
fixed infrastructure outside the firewall imposes two lim- 
itations: (1) a censor can discover and block the infras- 
tructure; (2) benevolent users outside the firewall must 
install and maintain it. As a result, these systems are 
somewhat easy for censors to monitor and block. For ex- 
ample, Tor has recently been blocked in China [45]. Al- 
though these systems may continue to enjoy widespread 
use, this recent turn of events does beg the question of 
whether there are fundamentally new approaches to ad- 
vancing this arms race: specifically, we explore whether 
it is possible to circumvent censorship firewalls with in- 
frastructure that is more pervasive, and that does not re- 
quire individual users or organizations to maintain it. 


We begin with a simple observation: countless sites al- 
low users to upload content to sites that they do not main- 
tain or own through a variety of media, ranging from 
photos to blog comments to videos. Leveraging the large 
number of sites that allow users to upload their own con- 
tent potentially yields many small cracks in censorship 
firewalls, because there are many different types of me- 
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dia that users can upload, and many different sites where 
they can upload it. The sheer number of sites that users 
could use to exchange messages, and the many differ- 
ent ways they could hide content, makes it difficult for a 
censor to successfully monitor and block all of them. 

In this paper, we design a system to circumvent cen- 
sorship firewalls using different types of user-generated 
content as cover traffic. We present Collage, a method 
for building message channels through censorship fire- 
walls using user-generated content as the cover medium. 
Collage uses existing sites to host user-generated con- 
tent that serves as the cover for hidden messages (e.g., 
photo-sharing, microblogging, and video-sharing sites). 
Hiding messages in photos, text, and video across a wide 
range of host sites makes it more difficult for censors to 
block all possible sources of censored content. Second, 
because the messages are hidden in other seemingly in- 
nocuous messages, Collage provides its users some level 
of deniability that they do not have in using existing sys- 
tems (e.g., accessing a Tor relay node immediately impli- 
cates the user that contacted the relay). We can achieve 
these goals with minimal out-of-band communication. 

Collage is not the first system to suggest using covert 
channels: much previous work has explored how to build 
a covert channel that uses images, text, or some other 
media as cover traffic, sometimes in combination with 
mix networks or proxies [3, 8, 17, 18, 21, 38, 41]. Other 
work has also explored how these schemes might be bro- 
ken [27], and others hold the view that message hiding 
or “steganography” can never be fully secure. Collage’s 
new contribution, then, is to design covert channels based 
on user-generated content and imperfect message-hiding 
techniques in a way that circumvents censorship firewalls 
that is robust enough to allow users to freely exchange 
messages, even in the face of an adversary that may be 
looking for such suspicious cover traffic. 

The first challenge in designing Collage is to develop 
an appropriate message vector for embedding messages 
in user-generated content. Our goal for developing a 
message vector is to find user-generated traffic (e.g., pho- 
tos, blog comments) that can act as a cover medium, is 
widespread enough to make it difficult for censors to 
completely block and remove, yet is common enough 
to provide users some level of deniability when they 
download the cover traffic. In this paper, we build mes- 
sage vectors using the user-generated photo-sharing site, 
Flickr [24], and the microblogging service, Twitter [49], 
although our system in no way depends on these partic- 
ular services. We acknowledge that some or all of these 
two specific sites may ultimately be blocked in certain 
countries; indeed, we witnessed that parts of Flickr were 
already blocked in China when accessed via a Chinese 
proxy in January 2010. A main strength of Collage’s de- 
sign is that blocking a specific site or set of sites will not 
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Figure 1: Collage’s interaction with the network. See 
Figure 2 for more detail. 


fully stem the flow of information through the firewall, 
since users can use so many sites to post user-generated 
content. We have chosen Flickr and Twitter as a proof of 
concept, but Collage users can easily use domestic equiv- 
alents of these sites to communicate using Collage. 

Given that there are necessarily many places where 
one user might hide a message for another, the second 
challenge is to agree on rendezvous sites where a sender 
can leave a message for a receiver to retrieve. We aim 
to build this message layer in a way that the client’s traf- 
fic looks innocuous, while still preventing the client from 
having to retrieve an unreasonable amount of unwanted 
content simply to recover the censored content. The ba- 
sic idea behind rendezvous is to embed message seg- 
ments in enough cover material so that it is difficult for 
the censor to block all segments, even if it joins the sys- 
tem as a user; and users can retrieve censored messages 
without introducing significant deviations in their traffic 
patterns. In Collage, senders and receivers agree on a 
common set of network locations where any given con- 
tent should be hidden; these agreements are established 
and communicated as “tasks” that a user must perform 
to retrieve the content (e.g., fetching a particular URL, 
searching for content with a particular keyword). Fig- 
ure 1 summarizes this process. Users send a message 
with three steps: (1) divide a message into many erasure- 
encoded “blocks” that correspond to a task, (2) embed 
these blocks into user-generated content (e.g., images), 
and (3) publish this content at user-generated content 
sites, which serve as rendezvous points between senders 
and receivers. Receivers then retrieve a subset of these 
blocks to recover the original message by performing one 
of these tasks. 

This paper presents the following contributions. 


e We present the design and implementation of Col- 
lage, a censorship-resistant message channel built 
using user-generated content as the cover medium. 
An implementation of the Collage message channel 
is publicly available [13]. 

e We evaluate the performance and security of Col- 
lage. Collage does impose some overhead, but the 
overhead is acceptable for small messages (e.g., ar- 
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ticles, emails, short messages), and Collage’s over- 
head can also be reduced at the cost of making the 
system less robust to blocking. 


e We present Collage’s general message-layer ab- 
straction and show how this layer can serve as the 
foundation for two different applications: Web pub- 
lishing and direct messaging (e.g., email). We de- 
scribe and evaluate these two applications. 


The rest of this paper proceeds as follows. Section 2 
presents related work. In Section 3, we describe the de- 
sign goals for Collage and the capabilities of the cen- 
sor. Section 4 presents the design and implementation 
of Collage. Section 5 evaluates the performance of Col- 
lage’s messaging layer and applications. Section 6 de- 
scribes the design and implementation of two applica- 
tions that are built on top of this messaging layer. Sec- 
tion 7 discusses some limitations of Collage’s design and 
how Collage might be extended to cope with increasingly 
sophisticated censors. Section 8 concludes. 


2 Background and Related Work 


We survey other systems that provide anonymous, con- 
fidential, or censorship-resistant communication. We 
note that most of these systems require setting up a 
dedicated infrastructure of some sort, typically based 
on proxies. Collage departs significantly from this ap- 
proach, since it leverages existing infrastructure. At the 
end of this section, we discuss some of the challenges 
in building covert communications channels using exist- 
ing techniques, which have also been noted in previous 
work [15]. 


Anonymization proxies. Conventional anti-censorship 
systems have typically consisted of simple Web proxies. 
For example, Anonymizer [3] is a proxy-based system 
that allows users to connect to an anonymizing proxy 
that sits outside a censoring firewall; user traffic to and 
from the proxy is encrypted. These types of systems pro- 
vide confidentiality, but typically do not satisfy any of 
the other design goals: for example, the existence of any 
encrypted traffic might be reason for suspicion (thus vi- 
olating deniability), and a censor that controls a censor- 
ing firewall can easily block or disrupt communication 
once the proxy is discovered (thus violating resilience). 
A censor might also be able to use techniques such as 
SSL fingerprinting or timing attacks to link senders and 
receivers, even if the underlying traffic is encrypted [29]. 
Infranet attempts to create deniability for clients by em- 
bedding censored HTTP requests and content in HTTP 
traffic that is statistically indistinguishable from “innocu- 
ous” HTTP traffic [21]. Infranet improves deniability, 
but it still depends on cooperating proxies outside the 
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firewall that might be discovered and blocked by cen- 
sors. Collage improves availability by leveraging the 
large number of user-generated content sites, as opposed 
to a relatively smaller number of proxies. 


One of the difficult problems with anti-censorship 
proxies is that a censor could also discover these prox- 
ies and block access to them. Feamster et al. pro- 
posed a proxy-discovery method based on frequency 
hopping [22]. Kaleidoscope is a peer-to-peer overlay 
network to provide users robust, highly available access 
to these proxies [42]. This system is complementary to 
Collage, as it focuses more on achieving availability, at 
the expense of deniability. Collage focuses more on pro- 
viding users deniability and preventing the censor from 
locating all hosts from where censored content might be 
retrieved. 


Anonymous publishing and messaging systems. 
CovertFS [5] is a file system that hides data in photos 
using steganography. Although the work briefly men- 
tions challenges in deniability and availability, it is easily 
defeated by many of the attacks discussed in Section 7. 
Furthermore, CovertFS could in fact be implemented us- 
ing Collage, thereby providing the design and security 
benefits described in this paper. 


Other existing systems allow publishers and clients 
to exchange content using either peer-to-peer networks 
(Freenet [12]) or using a storage system that makes 
it difficult for an attacker to censor content without 
also removing legitimate content from the system (Tan- 
gler [53]). Freenet provides anonymity and unlinkabil- 
ity, but does not provide deniability for users of the sys- 
tem, nor does it provide any inherent mechanisms for re- 
silience: an attacker can observe the messages being ex- 
changed and disrupt them in transit. Tangler’s concept of 
document entanglement could be applied to Collage to 
prevent the censor from discovering which images con- 
tain embedded information. 


Anonymizing mix networks. Mix networks (e.g., 
Tor [18], Tarzan [25], Mixminion [17]) offer a network 
of machines through which users can send traffic if they 
wish to communicate anonymously with one another. 
Danezis and Dias present a comprehensive survey of 
these networks [16]. These systems also attempt to pro- 
vide unlinkability; however, previous work has shown 
that, depending on its location, a censor or observer 
might be able to link sender and receiver [4, 6, 23, 33, 39, 
40]. These systems also do not provide deniability for 
users, and typically focus on anonymous point-to-point 
communication. In contrast, Collage provides a deniable 
means for asynchronous point-to-point communication. 
Finally, mix networks like Tor traditionally use a pub- 
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lic relay list which is easily blocked, although work has 
been done to try to rectify this [44, 45]. 


Message hiding and embedding techniques. Collage 
relies on techniques that can embed content into cover 
traffic. The current implementation of Collage uses an 
image steganography tool called outguess [38] for 
hiding content in images and a text steganography tool 
called SNOW [41] for embedding content in text. We 
recognize that steganography techniques offer no for- 
mal security guarantees; in fact, these schemes can and 
have been subject to various attacks (e.g., [27]). Danezis 
has also noted the difficulty in building covert channels 
with steganography alone [15]: not only can the algo- 
rithms be broken, but also they do not hide the identi- 
ties of the communicating parties. Thus, these functions 
must be used as components in a larger system, not as 
standalone “solutions”. Collage relies on the embedding 
functions of these respective algorithms, but its security 
properties do not hinge solely on the security properties 
of any single information hiding technique; in fact, Col- 
lage could have used watermarking techniques instead, 
but we chose these particular embedding techniques for 
our proof of concept because they had readily available, 
working implementations. One of the challenges that 
Collage’s design addresses is how to use imperfect mes- 
sage hiding techniques to build a message channel that is 
both available and offers some amount of deniability for 
users. 


3 Problem Overview 


We now discuss our model for the censor’s capabilities 
and our goals for circumventing a censor who has these 
capabilities. It is difficult, if not impossible, to fully de- 
termine the censor’s current or potential capabilities; as a 
result, Collage cannot provide formal guarantees regard- 
ing success or deniability. Instead, we present a model 
for the censor that we believe is more advanced than cur- 
rent capabilities and, hence, where Collage is likely to 
succeed. Nevertheless, censorship is an arms race, so as 
the censor’s capabilities evolve, attacks against censor- 
ship firewalls will also need to evolve in response. In 
Section 7, we discuss how Collage’s could be extended 
to deal with these more advanced capabilities as the cen- 
sor becomes more sophisticated. 

We note that although we focus on censors, Collage 
also depends on content hosts to store media containing 
censored content. Content hosts currently do not appear 
to be averse to this usage (e.g., to the best of our knowIl- 
edge, Collage does not violate the Terms of Service for 
either Flickr or Twitter), although if Collage were to be- 
come very popular this attitude would likely change. Al- 
though we would prefer content hosts to willingly serve 
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Collage content (e.g., to help users in censored regimes), 
Collage can use many content hosts to prevent any single 
host from compromising the entire system. 


3.1 The Censor 


We assume that the censor wishes to allow some Internet 
access to clients, but can monitor, analyze, block, and al- 
ter subsets of this traffic. We believe this assumption is 
reasonable: if the censor builds an entirely separate net- 
work that is partitioned from the Internet, there is little 
we can do. Beyond this basic assumption, there is a wide 
range of capabilities we can assume. Perhaps the most 
difficult aspect of modeling the censor is figuring out 
how much effort it will devote to capturing, storing, and 
analyzing network traffic. Our model assumes that the 
censor can deploy monitors at multiple network egress 
points and observe all traffic as it passes (including both 
content and headers). We consider two types of capabil- 
ities: targeting and disruption. 


Targeting. A censor might target a particular user be- 
hind the firewall by focusing on that user’s traffic pat- 
terns; it might also target a particular suspected content 
host site by monitoring changes in access patterns to that 
site (or content on that site). In most networks, a cen- 
sor can monitor all traffic that passes between its clients 
and the Internet. Specifically, we assume the censor can 
eavesdrop any network traffic between clients on its net- 
work and the Internet. A censor’s motive in passively 
monitoring traffic would most likely be either to deter- 
mine that a client was using Collage or to identify sites 
that are hosting content. To do so, the censor could moni- 
tor traffic aggregates (i.e., traffic flow statistics, like Net- 
Flow [34]) to determine changes in overall traffic pat- 
terns (e.g., to determine if some website or content has 
suddenly become more popular). The censor can also ob- 
serve traffic streams from individual users to determine 
if a particular user’s clickstream is suspicious, or other- 
wise deviates from what a real user would do. These 
capabilities lead to two important requirements for pre- 
serving deniability: traffic patterns generated by Collage 
should not skew overall distributions of traffic, and the 
traffic patterns generated by an individual Collage user 
must resemble the traffic generated by innocuous indi- 
viduals. 

To target users or sites, a censor might also use Col- 
lage as a sender or receiver. This assumption makes some 
design goals more challenging: a censor could, for exam- 
ple, inject bogus content into the system in an attempt to 
compromise message availability. It could also join Col- 
lage as a client to discover the locations of censored con- 
tent, so that it could either block content outright (thus 
attacking availability) or monitor users who download 
similar sets of content (thus attacking deniability). We 
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also assume that the censor could act as a content pub- 
lisher. Finally, we assume that a censor might be able to 
coerce a content host to shut down its site (an aggressive 
variant of actively blocking requests to a site). 


Disruption. A censor might attempt to disrupt commu- 
nications by actively mangling traffic. We assume the 
censor would not mangle uncensored content in any way 
that a user would notice. A censor could, however, inject 
additional traffic in an attempt to confuse Collage’s pro- 
cess for encoding or decoding censored content. We as- 
sume that it could also block traffic at granularities rang- 
ing from an entire site to content on specific sites. 


The costs of censorship. In accordance with Bellovin’s 
recent observations [7], we assume that the censor’s ca- 
pabilities, although technically limitless, will ultimately 
be constrained by cost and effort. In particular, we as- 
sume that the censor will not store traffic indefinitely, 
and we assume that the censor’s will or capability to an- 
alyze traffic prevents it from observing more complex 
statistical distributions on traffic (e.g., we assume that it 
cannot perform analysis based on joint distributions be- 
tween arbitrary pairs or groups of users). We also assume 
that the censor’s computational capabilities are limited: 
for example, performing deep packet inspection on ev- 
ery packet that traverses the network or running statisti- 
cal analysis against all traffic may be difficult or infea- 
sible, as would performing sophisticated timing attacks 
(e.g., examining inter-packet or inter-request timing for 
each client may be computationally infeasible or at least 
prohibitively inconvenient). As the censorship arms race 
continues, the censor may develop such capabilities. 


3.2 Circumventing the Censor 


Our goal is to allow users to send and receive mes- 
Sages across a censorship firewall that would otherwise 
be blocked; we want to enable users to communicate 
across the firewall by exchanging articles and short mes- 
sages (e.g., email messages and other short messages). In 
some cases, the sender may be behind the firewall (e.g., 
a user who wants to publish an article from within a cen- 
sored regime). In other cases, the receiver might be be- 
hind the firewall (e.g., a user who wants to browse a cen- 
sored website). 

We aim to understand Collage’s performance in real 
applications and demonstrate that it is “good enough” to 
be used in situations where users have no other means 
for circumventing the firewall. We therefore accept that 
our approach may impose substantial overhead, and we 
do not aim for Collage’s performance to be comparable 
to that of conventional networked communication. Ulti- 
mately, we strive for a system that is effective and easy 
to use for a variety of networked applications. To this 
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end, Collage offers a messaging library that can support 
these applications; Section 6 describes two example ap- 
plications. 

Collage’s main performance requirement is that the 
overhead should be small enough to allow content to be 
stored on sites that host user-generated content and to al- 
low users to retrieve the hidden content in a reasonable 
amount of time (to ensure that the system is usable), and 
with a modest amount of traffic overhead (since some 
users may be on connections with limited bandwidth). 
In Section 5, we evaluate Collage’s storage requirements 
on content hosting sites, the traffic overhead of each mes- 
sage (as well as the tradeoff between this overhead and 
robustness and deniability), and the overall transfer time 
for messages. 


In addition to performance requirements, we want 
Collage to be robust in the face of the censor that we 
have outlined in Section 3.1. We can characterize this ro- 
bustness in terms of two more general requirements. The 
first requirement is availability, which says that clients 
should be able to communicate in the face of a censor 
that is willing to restrict access to various content and 
services. Most existing censorship circumvention sys- 
tems do not prevent a censor from blocking access to 
the system altogether. Indeed, regimes such as China 
have blocked or hijacked applications ranging from web- 
sites [43] to peer-to-peer systems [46] to Tor itself [45]. 
We aim to satisfy availability in the face of the censor’s 
targeting capabilities that we described in Section 3.1. 


Second, Collage should offer users of the system some 
level of deniability; although this design goal is hard to 
quantify or formalize, informally, deniability says that 
the censor cannot discover the users of the censorship 
system. It is important for two reasons. First, if the 
censor can identify the traffic associated with an anti- 
censorship system, it can discover and either block or 
hijack that traffic. As mentioned above, a censor observ- 
ing encrypted traffic may still be able to detect and block 
systems such as Tor [18]. Second, and perhaps more im- 
portantly, if the censor can identify specific users of a 
system, it can coerce those users in various ways. Past 
events have suggested that censors are able and willing 
to both discover and block traffic or sites associated with 
these systems and to directly target and punish users who 
attempt to defeat censorship. In particular, China re- 
quires users to register with ISPs before purchasing Inter- 
net access at either home or work, to help facilitate track- 
ing individual users [10]. Freedom House reports that in 
six of fifteen countries they assessed, a blogger or online 
journalist was sentenced to prison for attempting to cir- 
cumvent censorship laws—prosecutions have occurred 
in Tunisia, Iran, Egypt, Malaysia, and India [26]—and 
cites a recent event of a Chinese blogger who was re- 
cently attacked [11]. As these regimes have indicated 
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Figure 2: Collage’s layered design model. Operations 
are in ovals; intermediate data forms are in rectangles. 


their willingness and ability to monitor and coerce indi- 
vidual users, we believe that attempting to achieve some 
level of deniability is important for any anti-censorship 
system. 

By design, a user cannot disprove claims that he en- 
gages in deniable communication, thus making it easier 
for governments and organizations to implicate arbitrary 
users. We accept this as a potential downside of deniable 
communications, but point out that organizations can al- 
ready implicate users with little evidence (e.g., [2]). 


4 Collage Design and Implementation 


Collage’s design has three layers and roughly mimics the 
layered design of the network protocol stack itself. Fig- 
ure 2 shows these three layers: the vector, message, and 
application layers. The vector layer provides storage for 
short data chunks (Section 4.1), and the message layer 
specifies a protocol for using the vector layer to send 
and receive messages (Section 4.2). A variety of appli- 
cations can be constructed on top of the message layer. 
We now describe the vector and message layers in de- 
tail, deferring discussion of specific applications to Sec- 
tion 6. After describing each of these layers, we discuss 
rendezvous, the process by which senders and receivers 
find each other to send messages using the message layer 
(Section 4.3). Finally, we discuss our implementation 
and initial deployment (Section 4.4). 


4.1 Vector Layer 


The vector layer provides a substrate for storing short 
data chunks. Effectively, this layer defines the “cover 
media” that should be used for embedding a message. 
For example, if a small message is hidden in the high 
frequency of a video then the vector would be, for ex- 
ample, a YouTube video. This layer hides the details of 
this choice from higher layers and exposes three oper- 
ations: encode, decode, and isEncoded. These op- 
erations encode data into a vector, decode data from an 
encoded vector, and check for the presence of encoded 
data given a secret key, respectively. 
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Collage imposes requirements on the choice of vec- 
tor. First, each vector must have some capacity to hold 
encoded data. Second, the population of vectors must 
be large so that many vectors can carry many messages. 
Third, to satisfy both availability and deniability, 1t must 
be relatively easy for users to deniably send and receive 
vectors containing encoded chunks. Fourth, to satisfy 
availability, it must be expensive for the censor to disrupt 
chunks encoded in a vector. Any vector layer with these 
properties will work with Collage’s design, although the 
deniability of a particular application will also depend 
upon its choice of vector, as we discuss in Section 7. 

The feasibility of the vector layer rests on a key obser- 
vation: data hidden in user-generated content serves as a 
good vector for many applications, since it is both popu- 
lous and comes from a wide variety of sources (1.€., many 
users). Examples of such content include images pub- 
lished on Flickr [24] (as of June 2009, Flickr had about 
3.6 billion images, with about 6 million new images per 
day [28]), tweets on Twitter [49] (Twitter had about half 
a million tweets per day [52], and Mashable projected 
about 18 million Twitter users by the end of 2009 [50]), 
and videos on YouTube [56], which had about 200, 000 
new videos per day as of March 2008 [57]. 

For concreteness, we examine two classes of vector 
encoding algorithms. The first option is steganography, 
which attempts to hide data in a cover medium such that 
only intended recipients of the data (e.g., those possess- 
ing a key) can detect its presence. Steganographic tech- 
niques can embed data in a variety of cover media, such 
as images, video, music, and text. Steganography makes 
it easy for legitimate Collage users to find vectors con- 
taining data and difficult for a censor to identify (and 
block) encoded vectors. Although the deniability that 
steganography can offer is appealing, key distribution is 
challenging, and almost all production steganography al- 
gorithms have been broken. Therefore, we cannot simply 
rely on the security properties of steganography. 

Another option for embedding messages 1s digital wa- 
termarking, which is similar to steganography, except 
that instead of hiding data from the censor, watermarking 
makes it difficult to remove the data without destroying 
the cover material. Data embedded using watermarking 
is perhaps a better choice for the vector layer: although 
encoded messages are clearly visible, they are difficult to 
remove without destroying or blocking a large amount of 
legitimate content. If watermarked content is stored in a 
lot of popular user-generated content, Collage users can 
gain some level of deniability simply because all popular 
content contains some message chunks. 

We have implemented two example vector layers. The 
first is image steganography applied to images hosted on 
Flickr [24]. The second is text steganography applied to 
user-generated text comments on websites such as blogs, 
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send(identifier, data) 


1 Create a rateless erasure encoder for data. 
2 for each suitable vector (e.g., image file) 
do 
Retrieve blocks from the erasure coder to 
meet the vector’s encoding capacity. 
Concatenate and encrypt these blocks using 
the identifier as the encryption key. 
encode the ciphertext into the vector. 
Publish the vector on a user-generated 
content host such that receivers 
can find it. See Section 4.3. 


receive(identifier) 


1 Create a rateless erasure decoder. 
2 while the decoder cannot decode the message 
3 do 
4 Find and fetch a vector from a 
user-generated content host. 
Check if the vector contains encoded 
data for this identifier. 
if the vector is encoded with message data 
then 
decode payload from the vector. 
Decrypt the payload. 
Split the plaintext into blocks. 
Provide each decrypted block to 
the erasure decoder. 
return decoded message from erasure decoder 





Figure 3: The message layer’s Send and receive opera- 
tions. 


YouTube [56], Facebook [20], and Twitter [49]. De- 
spite possible and known limitations to these approaches 
(e.g., [27]), both of these techniques have working imple- 
mentations with running code [38, 41]. As watermark- 
ing and other data-hiding techniques continue to become 
more robust to attack, and as new techniques and im- 
plementations emerge, Collage’s layered model can in- 
corporate those mechanisms. The goal of this paper is 
not to design better data-hiding techniques, but rather to 
build a censorship-resistant message channel that lever- 
ages these techniques. 


4.2 Message Layer 


The message layer specifies a protocol for using the vec- 
tor layer to send and receive arbitrarily long messages 
(i.e., exceeding the capacity of a single vector). Observ- 
able behavior generated by the message layer should be 
deniable with respect to the normal behavior of the user 
or users at large. 

Figure 3 shows the send and receive operations. 
send encodes message data in vectors and publishes 
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them on content hosts, while receive finds encoded vec- 
tors on content hosts and decodes them to recover the 
original message. The sender associates a message iden- 
tifier with each message, which should be unique for an 
application (e.g., the hash of the message). Receivers 
use this identifier to locate the message. For encoding 
schemes that require a key (e.g., [38]), we choose the 
key to be the message identifier. 


To distribute message data among several vectors, 
the protocol uses rateless erasure coding [9, 32], which 
generates a near-infinite supply of short chunks from a 
source message such that any appropriately-sized sub- 
set of those chunks can reassemble the original mes- 
sage. For example, a rateless erasure coder could take a 
80 KB message and generate 1 KB chunks such that any 
100-subset of those chunks recovers the original mes- 
sage. Step | of send initializes a rateless erasure encoder 
for generating chunks of the message; step 4 retrieves 
chunks from the encoder. Likewise, step 1 of receive 
creates a rateless erasure decoder, step 11 provides re- 
trieved chunks to the decoder, and step 12 recovers the 
message. 


Most of the remaining Send operations are straightfor- 
ward, involving encryption and concatenation (step 5), 
and operation of the vector layer’s encode function 
(step 6). Likewise, receive operates the vector layer’s 
decode function (step 8), decrypts and splits the pay- 
load (steps 9 and 10). The only more complex operations 
are step 7 of send and step 4 of receive, which publish 
and retrieve content from user-generated content hosts. 
These steps must ensure (1) that senders and receivers 
agree on locations of vectors and (2) that publishing and 
retrieving vectors is done in a deniable manner. We now 
describe how to meet these two requirements. 


4.3 Rendezvous: Matching Senders to 
Receivers 


Vectors containing message data are stored to and re- 
trieved from user-generated content hosts; to exchange 
messages, senders and receivers must first rendezvous. 
To do so, senders and receivers perform sequences of 
tasks, which are time-dependent sequences of actions. 
An example of a sender task is the sequence of HTTP 
requests (i.e., actions) and fetch times corresponding to 
“Upload photos tagged with ‘flowers’ to Flickr’; a cor- 
responding receiver task is “Search Flickr for photos 
tagged with ‘flowers’ and download the first 50 images.” 
This scheme poses many challenges: (1) to achieve deni- 
ability, all tasks must resemble observable actions com- 
pleted by innocuous entities not using Collage (e.g., 
browsing the Web), (2) senders must identify vectors 
suitable for each task, and (3) senders and receivers must 
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agree on which tasks to use for each message. This sec- 
tion addresses these challenges. 


Identifying suitable vectors. Task deniability depends 
on properly selecting vectors for each task. For exam- 
ple, for the receiver task “search for photos with key- 
word flowers,” the corresponding sender task (“publish 
a photo with keyword flowers’) must be used with pho- 
tos of flowers; otherwise, the censor could easily identify 
vectors containing Collage content as those vectors that 
do not match their keywords. To achieve this, the sender 
picks vectors with attributes (e.g., associated keywords) 
that match the expected content of the vector. 


Agreeing on tasks for a message. Each user maintains 
a list of deniable tasks for common behaviors involv- 
ing vectors (Section 4.1) and uses this list to construct 
a task database. The database is simply a table of pairs 
(7, T;), where T, is a sender task and T;,. is a receiver 
task. Senders and receivers construct pairs such that 7, 
publishes vectors in locations visited by 7;.. For exam- 
ple, if 7;. performs an image search for photos with key- 
word “‘flowers” then 7’, would publish only photos with 
that keyword (and actually depicting flowers). Given 
this database, the sender and receiver map each message 
identifier to one or more task pairs and execute 7, and 
T,-, respectively. 


The sender and receiver must agree on the mapping 
of identifiers to database entries; otherwise, the receiver 
will be unable to find vectors published by the sender. If 
the sender’s and receiver’s databases are identical, then 
the sender and receiver simply use the message identi- 
fier as an index into the task database. Unfortunately, 
the database may change over time, for a variety of rea- 
sons: tasks become obsolete (e.g., Flickr changes its page 
structure) and new tasks are added (e.g., it may be ad- 
vantageous to add a task for a new search keyword dur- 
ing a current event, such as an election). Each time the 
database changes, other users need to be made aware of 
these changes. To this end, Collage provides two oper- 
ations on the task database: add and remove. When a 
user receives an advertisement for a new task or a with- 
drawal of an existing task he uses these operations to up- 
date his copy of the task database. 


Learning task advertisements and withdrawals is ap- 
plication specific. For some applications, a central 
authority sends updates using Collage’s own message 
layer, while in others updates are sent offline (i.e., sep- 
arate from Collage). We discuss these options in Sec- 
tion 6. One feature is common to all applications: de- 
lays in propagation of database updates will cause dif- 
ferent users to have slightly different versions of the task 
database, necessitating a mapping for identifiers to tasks 
that is robust to slight changes to the database. 
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Figure 4: The expected number of common tasks when 
mapping the same message identifier to a task subset, be- 
tween two task databases that agree on varying percent- 
ages of tasks. 


To reconcile database disagreements, our algorithm 
for mapping message identifiers to task pairs uses con- 
sistent hash functions [30], which guarantee that small 
changes to the space of output values have minimal im- 
pact on the function mapping. We initialize the task 
database by choosing a pseudorandom hash function h 
(e.g., SHA-1) and precomputing h(t) for each task t. The 
algorithm for mapping an identifier (7 to a m-subset of 
the database is simple: compute h(/) and take the m 
entries from the task database with precomputed hash 
values closest to h(V/); these task pairs are the mapping 
for M. 


Using consistent hashing to map identifiers to task 
pairs provides an important property: updating the 
database results in only small changes to the mappings 
for existing identifiers. Figure 4 shows the expected 
number of tasks reachable after removing a percentage 
of the task database and replacing it with new tasks. 
As expected, increasing the number of tasks mapped for 
each identifier decreases churn. Additionally, even if half 
of the database is replaced, the sender and receiver can 
agree on at least one task when three or more tasks are 
mapped to each identifier. In practice, we expect the dif- 
ference between two task databases to be around 10%, so 
three tasks to each identifier is sufficient. Thus, two par- 
ties with slightly different versions of the task database 
can still communicate messages: although some tasks 
performed by the receiver (i.e., mapped using his copy 
of the database) will not yield content, most tasks will. 


Choosing deniable tasks. Tasks should mimic the nor- 
mal behavior of users, so that a user who is perform- 
ing these tasks is unlikely to be pinpointed as a Collage 
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user (which, in and of itself, could be incriminating). We 
design task sequences to “match” those of normal visi- 
tors to user-generated content sites. Tasks for different 
content hosts have different deniability criteria. For ex- 
ample, the task of looking at photos corresponding to a 
popular tag or tag pair offers some level of deniability, 
because an innocuous user might be looking at popular 
images anyway. The challenge, of course, is finding sets 
of tasks that are deniable, yet focused enough to allow a 
user to retrieve content in a reasonable amount of time. 
We discuss the issue of deniability further in Section 7. 


4.4 Implementation 


Collage requires minimal modification to existing infras- 
tructure, so it is small and self-contained, yet modu- 
lar enough to support many possible applications; this 
should facilitate adoption. We have released a version of 
Collage [13]. 

We have implemented Collage as a 650-line Python 
library, which handles the logic of the message layer, in- 
cluding the task database, vector encoding and decod- 
ing, and the erasure coding algorithm. To execute tasks, 
the library uses Selenium [1], a popular web browser au- 
tomation tool; Selenium visits web pages, fills out forms, 
clicks buttons and downloads vectors. Executing tasks 
using a real web browser frees us from implementing an 
HTTP client that produces realistic Web traffic (e.g., by 
loading external images and scripts, storing cookies, and 
executing asynchronous JavaScript requests). 

We represent tasks as Python functions that perform 
the requisite task. Table | shows four examples. Each 
application supplies definitions of operations used by the 
tasks (e.g., FindPhotosOfFlickrUser). The task 
database is a list of tasks, sorted by their MDS hash; 
to map an identifier to a set of tasks, the database finds 
the tasks with hashes closest to the hash of the message 
identifier. After mapping, receivers simply execute these 
tasks and decode the resulting vectors. Senders face a 
more difficult task: they must supply the task with a vec- 
tor suitable for that task. For instance, the task “publish 
a photo tagged with ‘flowers’ must be supplied with a 
photo of flowers. We delegate the task of finding vectors 
meeting specific requirements to a vector provider. The 
exact details differ between applications; one of our ap- 
plications searches a directory of annotated photos, while 
another prompts the user to type a phrase containing cer- 
tain words (e.g., “Olympics’’). 


5 Performance Evaluation 


This section evaluates Collage according to the three per- 
formance metrics introduced in Section 3: storage over- 
head on content hosts, network traffic, and transfer time. 
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We characterize Collage’s performance by measuring its 
behavior in response to a variety of parameters. Recall 
that Collage (1) processes a message through an erasure 
coder, (2) encodes blocks inside vectors, (3) executes 
tasks to distribute the message vectors to content hosts, 
(4) retrieves some of these vectors from content hosts, 
and (5) decodes the message on the receiving side. Each 
stage can affect performance. In this section, we evalu- 
ate how each of these factors affects the performance of 
the message layer; Section 6 presents additional perfor- 
mance results for Collage applications using real content 
hosts. 


e Erasure coding can recover an n-block message 
from (1+ 5)n of its coded message blocks. Collage 
uses € = 0.01, as recommended by [32], yielding an 
expected 0.5% increase in storage, traffic, and trans- 
fer time of a message. 


e Vector encoding stores erasure coded blocks inside 
vectors. Production steganography tools achieve 
encoding rates of between 0.01 and 0.05, translating 
to between 20 and 100 factor increases in storage, 
traffic, and transfer time [38]. Watermarking algo- 
rithms are less efficient; we hope that innovations in 
information hiding can reduce this overhead. 


e Sender and receiver tasks publish and retrieve 
vectors from content hosts. Tasks do not affect the 
storage requirement on content hosts, but each task 
can impose additional traffic and time. For exam- 
ple, a task that downloads images by searching for 
them on Flickr can incur hundreds of kilobytes of 
traffic before finding encoded vectors. Depending 
on network connectivity, this step could take any- 
where from a few seconds to a few minutes and can 
represent an overhead of several hundred percent, 
depending on the size of each vector. 


e The number of executed tasks differs between 
senders and receivers. The receiver performs as 
many tasks as necessary until it is able to decode 
the message; this number depends on the size of 
the message, the number of vectors published by 
the sender, disagreements between sender and re- 
ceiver task databases, the dynamics of the content 
host (e.g., a surge of Flickr uploads could “bury” 
Collage encoded vectors), and the number of tasks 
and vectors blocked by the censor. While testing 
Collage, we found that we needed to execute only 
one task for the majority of cases. 


The sender must perform as many tasks as neces- 
sary so that, given the many ways the receiver can 
fail to obtain vectors, the receiver will still be able 
to retrieve enough vectors to decode the message. 
In practice, this number is difficult to estimate and 
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Content host [ Sender task 


Flickr PublishAsUser(‘User’, Photo, MsgData) FindPhotosOfFlickrUser(‘User’ ) 


H 
oO 
u 


Twitter PostIweet(‘Watching the Olympics’, MsgData) | SearchTwitter(‘Olympics’ ) — 80.0% blocked 
— 60.0% blocked 
40.0% blocked 


20.0% blocked 


10.0x send rate —- 768/384 Kbps 
— 6000/1000 Kbps 


300 768/10000 Kbps 





—  2.0x send rate 


1.5x send rate 


bh 
oO 
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Table 1: Examples of sender and receiver task snippets. 
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vectors are scarce, so the sender simply uploads as 
many vectors as possible. 


We implemented a Collage application that publishes 
vectors on a simulated content host, allowing us to ob- 
serve the effects of these parameters. Figure 5 shows the 
results of running several experiments across Collage’s 
parameter space. The simulation sends and receives a 
23 KB one-day news summary. The message is erasure 
coded with a block size of 8 bytes and encoded into sev- 
eral vectors randomly drawn from a pool for vectors with 
average size 200 KB. Changing the message size scales 
the metrics linearly, while increasing the block size only 
decreases erasure coding efficiency. 

Figure 5a demonstrates the effect of vector encoding 
efficiency on required storage on content hosts. We used 
a fixed-size identifier-to-task mapping of ten tasks. We 
chose four send rates, which are multiples of the mini- 
mum number of tasks required to decode the message: 
the sender may elect to send more vectors if he believes 
some vectors may be unreachable by the receiver. For 
example, with a send rate of 10x, the receiver can still 
retrieve the message even if 90% of vectors are unavail- 
able. Increasing the task mapping size may be necessary 
for large send rates, because sending more vectors re- 
quires executing more tasks. These results give us hope 
for the future of information hiding technology: current 
vector encoding schemes are around 5% efficient; ac- 
cording to Figure 5a, this a region where a significant 
reduction in storage is possible with only incremental 
improvements in encoding techniques (i.e., the slope is 
steep). 

Figure 5b predicts total sender and receiver traffic 
from task overhead traffic, assuming 1 MB of vector stor- 
age on the content host. As expected, blocking more vec- 
tors increases traffic, as the receiver must execute more 
tasks to receive the same message content. Increasing 
storage beyond 1 MB decreases receiver traffic, because 
more message vectors are available for the same block- 
ing rate. An application executed on a real content host 
transfers around 1 MB of overhead traffic for a 23 KB 
message. 

Finally, Figure 5c shows the overall transfer time for 
senders and receivers, given varying time overheads. 
These overheads are optional for both senders and re- 
ceivers and impose delays between requests to evade 
timing analysis by the censor. For example, Collage 
could build a distribution of inter-request timings from 
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the user’s normal (i.e., non-Collage) traffic and impose 
this timing distribution on Collage tasks. We simulated 
the total transfer time using three network connection 
speeds. The first (768 Kbps download and 384 Kbps 
upload) is a typical entry-level broadband package and 
would be experienced if both senders and receivers are 
typical users within the censored domain. The second 
(768/10000 Kbps) would be expected if the sender has a 
high-speed connection, perhaps operating as a dedicated 
publisher outside the censored domain; one of the ap- 
plications in Section 6 follows this model. Finally, the 
6000/1000 Kbps connection represents expected next- 
generation network connectivity in countries experienc- 
ing censorship. In all cases, reasonable delays are im- 
posed upon transfers, given the expected use cases of 
Collage (e.g., fetching daily news article). We confirmed 
this result: a 23 KB message stored on a real content host 
took under 5 minutes to receive over an unreliable broad- 
band wireless link; sender time was less than 1 minute. 


6 Building Applications with Collage 


Developers can build a variety of applications using the 
Collage message channel. In this section, we outline re- 
quirements for using Collage and present two example 
applications. 


6.1 Application Requirements 


Even though application developers use Collage as a se- 
cure, deniable messaging primitive, they must still re- 
main conscious of overall application security when us- 
ing these primitives. Additionally, the entire vector layer 
and several parts of the message layer presented in Sec- 
tion 4 must be provided by the application. These com- 
ponents can each affect correctness, performance, and 
security of the entire application. In this section, we dis- 
cuss each of these components. Table 2 summarizes the 
component choices. 


Vectors, tasks, and task databases. Applications spec- 
ify a class of vectors and a matching vector encoding al- 
gorithm (e.g., Flickr photos with image steganography) 
based on their security and performance characteristics. 
For example, an application requiring strong content de- 
niability for large messages could use a strong steganog- 
raphy algorithm to encode content inside of videos. 
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Figure 5: Collage’s performance metrics, as measured using a simulated content host. 


| | Web Content Proxy (Sec. 6.2) Covert Email (Sec. 6.3) | Other options 


Vectors Photos Text 
Text steganography 
Covert Email users 


Vector encoding Image steganography 


Users of content hosts 


Vector sources 
Tasks Upload/download Flickr photos ___ Post/receive Tweets 


Database distribution | Send by publisher via proxy 
Identifier security 


Agreement by users 
Distributed by publisher, groups Group key 


Videos, music 

Video steganography, digital watermarking 
Automatic generation, crawl the Web 
Other user-generated content host(s) 
Prearranged algorithm, “sneakernet” 
Existing key distribution infrastructure 





Table 2: Summary of application components. 


Tasks are application-specific: uploading photos to 
Flickr is different from posting tweets on Twitter. Appli- 
cations insert tasks into the task database, and the mes- 
sage layer executes these tasks when sending and receiv- 
ing messages. The applications specify how many tasks 
are mapped to each identifier for database lookups. In 
Section 4.3, we showed that mapping each identifier to 
three tasks ensures that, on average, users can still com- 
municate even with slightly out-of-date databases; appli- 
cations can further boost availability by mapping more 
tasks to each identifier. 


Finally, applications must distribute the task database. 
In some instances, a central authority can send the 
database to application users via Collage itself. In other 
cases, the database is communicated offline. The appli- 
cation’s task database should be large enough to ensure 
diversity of tasks for messages published at any given 
time; if m messages are published every day, then the 
database should have cn tasks, where c is at least the 
size of the task mapping. Often, tasks can be generated 
programmatically, to reduce network overhead. For ex- 
ample, our Web proxy (discussed next) generates tasks 
from a list of popular Flickr tags. 


Sources of vectors. Applications must acquire vectors 
used to encode messages, either by requiring end-users 
to provide their own vectors (e.g., from a personal photo 
collection), automatically generating them, or obtaining 
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them from an external source (e.g., a photo donation sys- 
tem). 


Identifier security. Senders and receivers of a message 
must agree on a message identifier for that message. This 
process is analogous to key distribution. There is a gen- 
eral tradeoff between ease of message identifier distri- 
bution and security of the identifier: if users can easily 
learn identifiers, then more users will use the system, but 
it will also be easier for the censor to obtain the identi- 
fier; the inverse is also true. Developers must choose a 
distribution scheme that meets the intended use of their 
application. We discuss two approaches in the next two 
sections, although there are certainly other possibilities. 


Application distribution and bootstrapping. Users ul- 
timately need a secure one-time mechanism for obtain- 
ing the application, without using Collage. A variety of 
distribution mechanisms are possible: clients could re- 
ceive software using spam or malware as a propagation 
vector, or via postal mail or person-to-person exchange. 
There will ultimately be many ways to distribute appli- 
cations without the knowledge of the censor. Other sys- 
tems face the same problem [21]. This requirement does 
not obviate Collage, since once the user has received the 
software, he or she can use it to exchange an arbitrary 
number of messages. 


To explore these design parameters in practice, we built 
two applications using Collage’s message layer. The first 
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Figure 6: Proxied Web content passes through multiple 
parties before publication on content hosts. Each group 
downloads a different subset of images when fetching the 
same URL. 


is a Web content proxy whose goal is to distribute content 
to many users; the second is a covert email system. 


6.2 Web Content Proxy 


We have built an asynchronous Web proxy using Col- 
lage’s message layer, with which a publisher in an un- 
censored region makes content available to clients inside 
censored regimes. Unlike traditional proxies, our proxy 
shields both the identities of its users and the content 
hosted from the censor. 

The proxy serves small Web documents, such as arti- 
cles and blog posts, by steganographically encoding con- 
tent into images hosted on photo-sharing websites like 
Flickr and Picasa. A standard steganography tool [38] 
can encode a few kilobytes in a typical image, mean- 
ing most hosted documents will fit within a few images. 
To host many documents simultaneously, however, the 
publisher needs a large supply of images; to meet this 
demand, the publisher operates a service allowing gen- 
erous users of online image hosts to donate their im- 
ages. The service takes the images, encodes them with 
message data, and returns the encoded images to their 
owners, who then upload them to the appropriate image 
hosts. Proxy users download these photos and decode 
their contents. Figure 6 summarizes this process. Notice 
that the publisher is outside the censored domain, which 
frees us from worrying about sender deniability. 

To use a proxy, users must discover a publisher, reg- 
ister with that publisher, and be notified of an encryp- 
tion key. Publishers are identified by their public key 
so discovering publishers is reduced to a key distribution 
exercise, albeit that these keys must be distributed with- 
out the suspicion of the censor. Several techniques are 
feasible: the key could be delivered alongside the client 
software, derived from a standard SSL key pair, or dis- 
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tributed offline. Like any key-based security system, our 
proxy must deal with this inherent bootstrapping prob- 
lem. 

Once the client knows the publisher’s public key, it 
sends a message requesting registration. The message 
identifier is the publisher’s public key and the message 
payload contains the public key of the client encrypted 
using the publisher’s public key. This encryption ensures 
that only the publisher knows the client’s public key. The 
publisher receives and decrypts the client’s registration 
request using his own private key. 

The client is now registered but doesn’t know where 
content is located. Therefore, the publisher sends the 
client a message containing a group key, encrypted using 
the client’s public key. The group Key is shared between a 
small number of proxy users and is used to discover iden- 
tifiers of content. For security, different groups of users 
fetch content from different locations; this prevents any 
one user from learning about (and attacking) all content 
available through the proxy. 

After registration is complete, clients can retrieve con- 
tent. To look up a URL u, a client hashes w with a keyed 
hash function using the group key. It uses the hash as the 
message identifier for receive. 

Unlike traditional Web proxies, only a limited amount 
of content is available though our proxy. Therefore, 
to accommodate clients’ needs for unavailable content, 
clients can suggest content to be published. To suggest a 
URL, a client sends the publisher a message containing 
the requested URL. If the publisher follows the sugges- 
tion, then it publishes the URL for users of that client’s 
group key. 

Along with distributing content, the publisher pro- 
vides updates to the task database via the proxy itself 
(at the URL proxy://updates). The clients oc- 
casionally fetch content from this URL to keep syn- 
chronized with the publisher’s task database. The con- 
sistent hashing algorithm introduced in Section 4.3 al- 
lows updates to be relatively infrequent; by default, the 
proxy client updates its database when 20% of tasks have 
been remapped due to churn (i.e., there is a 20% reduc- 
tion in the number of successful task executions). Fig- 
ure 4 shows that there may be many changes to the task 
database before this occurs. 


Implementation and Evaluation. We have imple- 
mented a simple version of the proxy and can use it 
to publish and retrieve documents on Flickr. The task 
database is a set of tasks that search for combinations 
(e.g., “vacation” and “beach”’) of the 130 most popular 
tags. A 23 KB one-day news summary requires nine 
JPEG photos (© 3 KB data per photo, plus encoding 
overhead) and takes approximately | minute to retrieve 
over a fast network connection; rendering web pages and 
large photos takes a significant fraction of this time. Note 
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that the document is retrieved immediately after publi- 
cation; performance decays slightly over time because 
search results are displayed in reverse chronological or- 
der. We have also implemented a photo donation service, 
which accepts Flickr photos from users, encodes them 
with censored content, and uploads them on the user’s 
behalf. This donation service is available for down- 
load [13]. 


6.3 Covert Email 


Although our Web proxy provides censored content to 
many users, it is susceptible to attack from the censor 
for precisely this reason: because no access control is 
performed, the censor could learn the locations of pub- 
lished URLs using the proxy itself and potentially mount 
denial-of-service attacks. To provide greater security and 
availability, we present Covert Email, a point-to-point 
messaging system built on Collage’s message layer that 
excludes the censor from sending or receiving messages, 
or observing its users. This design sacrifices scalability: 
to meet these security requirements, all key distribution 
is done out of band, similar to PGP key signing. 

Messages sent with Covert Email will be smaller and 
potentially more frequent than for the proxy, so Covert 
Email uses text vectors instead of image vectors. Us- 
ing text also improves deniability, because receivers are 
inside the censored domain, and publishing a lot of 
text (e.g., comments, tweets) is considered more deni- 
able than many photos. Blogs, Twitter, and comment 
posts can all be used to store message chunks. Because 
Covert Email is used between a closed group of users 
with a smaller volume of messages, the task database is 
smaller and updated less often without compromising de- 
niability. Additionally, users can supply the text vectors 
needed to encode content (i.e., write or generate them), 
eliminating the need for an outside vector source. This 
simplifies the design. 

Suppose a group of mutually trusted users wishes to 
communicate using Covert Email. Before doing so, it 
must establish a shared secret key, for deriving message 
identifiers for sending and receiving messages. This one- 
time exchange is done out-of-band; any exchange mech- 
anism works as long as the censor is unaware that a key 
exchange takes place. Along with exchanging keys, the 
group establishes a task database. At present, a database 
is distributed with the application; the group can aug- 
ment its task database and notify members of changes 
using Covert Email itself. 

Once the group has established a shared key and a task 
database, its members can communicate. To send email 
to Bob, Alice generates a message identifier by encrypt- 
ing a tuple of his email address and the current date, us- 
ing the shared secret key. The date serves as a salt and 
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ensures variation in message locations over time. Al- 
ice then sends her message to Bob using that identifier. 
Here, Bob’s email address is used only to uniquely iden- 
tify him within the group; in particular, the domain por- 
tion of the address serves no purpose for communication 
within the group. 


To receive new mail, Bob attempts to receive mes- 
sages with identifiers that are the encryption of his email 
address and some date. To check for new messages, he 
checks each date since the last time he checked mail. 
For example, if Bob last checked his mail yesterday, he 
checks two dates: yesterday and today. 


If one group member is outside the censored domain, 
then Covert Email can interface with traditional email. 
This user runs an email server and acts as a proxy for 
the other members of the group. To send mail, group 
members send a message to the proxy, requesting that 
it be forwarded to a traditional email address. Like- 
wise, when the proxy receives a traditional email mes- 
sage, it forwards it to the requisite Covert Email user. 
This imposes one obvious requirement on group mem- 
bers sending mail using the proxy: they must use email 
addresses where the domain portion matches the domain 
of the proxy email server. Because the domain serves no 
other purpose in Covert Email addresses, implementing 
this requirement is easy. 


Implementation and Evaluation. We have imple- 
mented a prototype application for sending and retriev- 
ing Covert Email. Currently, the task database is a set 
of tasks that search posts of other Twitter users. We 
have also written tasks that search for popular keywords 
(e.g., “World Cup’). To demonstrate the general ap- 
proach, we have implemented an (insecure) proof-of- 
concept steganography algorithm that stores data by al- 
tering the capitalization of words. Sending a short 194- 
byte message required three tweets and took five sec- 
onds. We have shown that Covert E-mail has the po- 
tential to work in practice, although this application ob- 
viously needs many enhancements before general use, 
most notably a secure text vector encoding algorithm and 
more deniable task database. 


7 Threats to Collage 


This section discusses limitations of Collage in terms of 
the security threats it is likely to face from censors; we 
also discuss possible defenses. Recall from Section 3.2 
that we are concerned with two security metrics: avail- 
ability and deniability. Given the unknown power of the 
censor and lack of formal information hiding primitives 
in this context, both goals are necessarily best effort. 
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7.1 Availability 


A censor may try to prevent clients from sending and re- 
ceiving messages. Our strongest argument for Collage’s 
availability depends on a censor’s unwillingness to block 
large quantities of legitimate content. This section dis- 
cusses additional factors that contribute to Collage’s cur- 
rent and future availability. 

The censor could block message vectors, but a cen- 
sor that wishes to allow access to legitimate content may 
have trouble doing so since censored messages are en- 
coded inside otherwise legitimate content, and message 
vectors are, by design, difficult to remove without de- 
stroying the cover content. Furthermore, some encod- 
ing schemes (e.g., steganography) are resilient against 
more determined censors, because they hide the presence 
of Collage data; blocking encoded vectors then also re- 
quires blocking many legitimate vectors. 

The censor might instead block traffic patterns resem- 
bling Collage’s tasks. From the censor’s perspective, do- 
ing so may allow legitimate users access to content as 
long as they do not use one of the many tasks in the task 
database to retrieve the content. Because tasks in the 
database are “popular” among innocuous users by de- 
sign, blocking a task may also disrupt the activities of 
legitimate users. Furthermore, if applications prevent the 
censor from knowing the task database, mounting this 
attack becomes quite difficult. 

The censor could block access to content hosts, 
thereby blocking access to vectors published on those 
hosts. Censors have mounted this attack in practice; for 
example, China is currently blocking Flickr and Twitter, 
at least in part [43]. Although Collage cannot prevent 
these sites from being blocked, applications can reduce 
the impact of this action by publishing vectors across 
many user-generated content sites, so even if the cen- 
sor blocks a few popular sites there will still be plenty of 
sites that host message vectors. One of the strengths of 
Collage’s design is that it does not depend on any spe- 
cific user-generated content service: any site that can 
host content for users can act as a Collage drop site. 

The censor could also try to prevent senders from pub- 
lishing content. This action is irrelevant for applications 
that perform all publication outside a censored domain. 
For others, it is impractical for the same reasons that 
blocking receivers is impractical. Many content hosts 
(e.g., Flickr, Twitter) have third-party publication tools 
that act as proxies to the publication mechanism [51]. 
Blocking all such tools is difficult, as evidenced by Iran’s 
failed attempts to block Twitter [14]. 

Instead of blocking access to publication or retrieval 
of user-generated content, the censor could coerce con- 
tent hosts to remove vectors or disrupt the content inside 
them. For certain vector encodings (e.g., steganography) 
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the content host may be unable to identify vectors con- 
taining Collage content; in other cases (e.g., digital wa- 
termarking), removing encoded content is difficult with- 
out destroying the outward appearance of the vector (e.g., 
removing the watermark could induce artifacts in a pho- 
tograph). 


7.2  Deniability 


As mentioned in Section 3.1, the censor may try to 
compromise the deniability of Collage users. Intuitively, 
a Collage user’s actions are deniable if the censor can- 
not distinguish the use of Collage from “normal” Internet 
activity. Deniability is difficult to quantify; others have 
developed metrics for anonymity [39], and we are work- 
ing on quantitative metrics for deniability in our ongoing 
work. Instead, we explore deniability somewhat more in- 
formally and aim to understand how a censor can attack 
a Collage user’s deniability and how future extensions to 
Collage might mitigate these threats. The censor may at- 
tempt to compromise the deniability of either the sender 
or the receiver of a message. We explore various ways 
the censor might mount these attacks, and possible ex- 
tensions to Collage to defend against them. 

The censor may attempt to identify senders. Appli- 
cations can use several techniques to improve deniabil- 
ity. First, they can choose deniable content hosts; if a 
user has never visited a particular content host, it would 
be unwise to upload lots of content there. Second, vec- 
tors must match tasks; if a task requires vectors with cer- 
tain properties (e.g., tagged with “vacation”), vectors not 
meeting those requirements are not deniable. The vec- 
tor provider for each application is responsible for ensur- 
ing this. Finally, publication frequency must be indistin- 
guishable from a user’s normal behavior and the publi- 
cation frequency of innocuous users. 

The censor may also attempt to identify receivers, by 
observing their task sequences. Several application pa- 
rameters affect receiver deniability. As the size of the 
task database grows, clients have more variety (and thus 
deniability), but must crawl through more data to find 
message chunks. Increasing the number of tasks mapped 
to each identifier gives senders more choice of publica- 
tion locations, but forces receivers to sift through more 
content when retrieving messages. Increasing variety of 
tasks increases deniability, but requires a human author 
to specify each type of task. The receiver must decide 
an ordering of tasks to visit; ideally, receivers only visit 
tasks that are popular among innocuous users. 

Ultimately, the censor may develop more sophisticated 
techniques to defeat user deniability. For example, a cen- 
sor may try to target individual users by mounting timing 
attacks (as have been mounted against other systems like 
Tor [4, 33]), or may look at how browsing patters change 
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across groups of users or content sites. In these cases, 
we believe it is possible to extend Collage so that its re- 
quest patterns more closely resemble those of innocuous 
users. To do so, Collage could monitor a user’s normal 
Web traffic and allow Collage traffic to only perturb ob- 
servable distributions (e.g., inter-request timings, traffic 
per day, etc.) by small amounts. Doing so could obvi- 
ously have massive a impact on Collage’s performance. 
Preliminary analysis shows that over time this technique 
could yield sufficient bandwidth for productive commu- 
nication, but we leave its implementation to future work. 


$ Conclusion 


Internet users in many countries need safe, robust mech- 
anisms to publish content and the ability to send or pub- 
lish messages in the face of censorship. Existing mecha- 
nisms for bypassing censorship firewalls typically rely on 
establishing and maintaining infrastructure outside the 
censored regime, typically in the form of proxies; un- 
fortunately, when a censor blocks these proxies, the sys- 
tems are no longer usable. This paper presented Collage, 
which bypasses censorship firewalls by piggybacking 
messages on the vast amount and types of user-generated 
content on the Internet today. Collage focuses on provid- 
ing both availability and some level of deniability to its 
users, in addition to more conventional security proper- 
ties. 

Collage is a further step in the ongoing arms race to 
circumvent censorship. As we discussed, it is likely that, 
upon seeing Collage, censors will take the next steps to- 
wards disrupting communications channels through the 
firewall—perhaps by mangling content, analyzing joint 
distributions of access patterns, or analyzing request 
timing distributions. However, as Bellovin points out: 
‘“There’s no doubt that China—or any government so- 
minded—can censor virtually everything; it’s just that 
the cost—cutting most communications lines, and de- 
ploying enough agents to vet the rest—is prohibitive. 
The more interesting question is whether or not ‘enough’ 
censorship is affordable.” [7] Although Collage itself 
may ultimately be disrupted or blocked, it represents an- 
other step in making censorship more costly to the cen- 
sors; we believe that its underpinnings—the use of user- 
generated content to pass messages through censorship 
firewalls—will survive, even as censorship techniques 
grow increasingly more sophisticated. 
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Abstract 


Many techniques have been proposed to generate keys 
including text passwords, graphical passwords, biomet- 
ric data and etc. Most of these techniques are not resis- 
tant to coercion attacks in which the user is forcefully 
asked by an attacker to generate the key to gain access to 
the system or to decrypt the encrypted file. We present 
a novel approach in generating cryptographic keys to 
fight against coercion attacks. Our novel technique in- 
corporates the user’s emotional status, which changes 
when the user is under coercion, into the key generation 
through measurements of the user’s skin conductance. 
We present a model that generates cryptographic keys 
with one’s voice and skin conductance. In order to ex- 
plore more, a preliminary user study with 39 subjects was 
done which shows that our approach has moderate false- 
positive and false-negative rates. We also present the at- 
tacker’s strategy in guessing the cryptographic keys, and 
show that the resulting change in the password space un- 
der such attacks is small. 


1 Introduction 


Many techniques have been proposed to generate strong 
cryptographic keys for secure communication and au- 
thentication. Some of these techniques, e.g., those us- 
ing biometrics [15, 24, 27, 28, 35], offer desirable secu- 
rity properties including ease of use, unforgettability, un- 
forgeability (to some extent), high entropy and etc. How- 
ever, most of these schemes are not resistant to coercion 
attacks in which the user is forcefully asked by an at- 
tacker to reveal the key [32]. When the user’s life is 
threatened by an attacker, one would have to surrender 
the key, and the system will be compromised despite all 
the security properties described above. In this paper, we 
present a novel approach to protection against coercion 
attacks in generating keys. 

For a cryptographic key generation technique to be co- 
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ercion attack resistant, it is required that when the user 
is under coercion, he/she will have no way of generat- 
ing the key, or the key generated will never be the same 
as the one generated when he/she is not being coerced. 
If this requirement is met, then an adversary would not 
apply any threat to him/her because the adversary under- 
stands that the user would not be able to generate the key 
when he is threatened to do so. Here we assume that the 
coercion resistance property is publicly known to every- 
one, including the attackers; otherwise it might lead to 
a dangerous situation for the user, a problem we do not 
address in this paper. 

To show how desirable it is to have a coercion-resistant 
cryptographic key generation technique, here we list a 
few scenarios in which such a technique could be useful: 


e Bank’s vault and safe: According to statistics re- 
leased by the FBI [17], there were 1,094 reported 
robberies (out of which 58 cases were of vault/safe 
robberies) of commercial banks between July 1, 
2009 and September 30, 2009 totaling more than 
$9.4 million. If such systems are used to fight 
against these attacks, then managers will never be 
forced to open the vault. 


e Cockpit doors on airliners: The hijackers of the 
September 11, 2001 use the fueled aircraft as a mis- 
sile to destroy ground targets. If the cockpit doors 
on airliners are well equipped with coercion resisted 
techniques, then hijackers can never force a flight 
attendant to open the door. 


e Secret/capability holders in a war: secret and ca- 
pability holders would not be forced to reveal the 
secret or use the capability. 


In this paper, we explore the incorporation of user’s 
emotional status (through the measure of skin conduc- 
tance) into the process of key generation to achieve co- 
ercion resistance. We demonstrate this possibility by in- 
corporating skin conductance into a previously proposed 
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key generation technique using biometrics [24] (see Fig- 
ure 1). 
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Figure 1: Coercion attacks in key generation 


Incorporating skin conductance information into key 
generation is nontrivial. First, the fact that a change in a 
user’s emotional status leads to changes in a user’s skin 
conductance does not necessarily mean that our proposed 
technique is coercion resistant. If known patterns exist 
in such changes, an attacker might be able to guess the 
skin conductance of the user when he is not nervous by, 
e.g., flipping a few bits of the feature key (see Section 4) 
generated from the skin conductance of the user when he 
is nervous. We analyze this attack and its consequences, 
and show that the reduction in password space is small. 

Second, we hope that the key generation algorithm 
will take in the least amount of user specific informa- 
tion except the live data collected when it is used. This is 
because the key generation algorithm might be executed 
from the client’s machine, and the inputs to the algorithm 
could potentially be retrieved by the attacker during a 
coercion attack. However, when dealing with biomet- 
rics data, removing such user specific information from 
the inputs of the algorithm is not plausible, as different 
people have different sets of consistent and inconsistent 
biometric features. The algorithm would have too high 
false-negative rates without this additional user specific 
information. We propose using only user-specific feature 
lookup tables which contain valid key shares or garbage. 
We also analyze conceivable attacks that result from our 
proposal. 

Third, it is nontrivial how a user study can be per- 
formed to evaluate our technique. We need to collect bio- 
metric data corresponding to different emotional states 
of real human beings. Efforts in this area are more de- 
manding than traditional efforts to get pattern recogni- 
tion data [31]. To analyze the effectiveness of our pro- 
posal, we perform a user study to see how one’s skin 
conductance changes when he/she is being coerced. This 
is used to evaluate the false-positive and false-negative 
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rates of our model, and to analyze the attacker’s strategy 
in guessing the cryptographic key. With 39 participants 
in our user study, we find that our technique enjoys mod- 
erate false-positive and false-negative rates in key gen- 
eration. Furthermore, we find that the reduction in the 
password space for an informed attacker is small. 

The rest of the paper is organized as follows. In Sec- 
tion 2, we discuss some state-of-the-art approaches in 
cryptographic key generation and recognition of emo- 
tional response. Background knowledge about the cho- 
sen biometrics and fingerprint are discussed in Section 3. 
In Section 4, we present the details of our approach in 
key generation using skin conductance and voice. The 
user study and results are presented in Sections 5 and 
Section 6 respectively. We conclude in Section 7 with 
some plausible future work. 


2 Related Work 


In this section we review some of the techniques and 
methodologies used to generate cryptographic keys from 
biometrics and some previous work on the emotion 
recognition schemes using physiological signals. 

Many key generation techniques from biometrics, e.g., 
voice, iris, face, fingerprints, keystroke dynamics, and 
etc., have been proposed in the last decade [15, 24, 27, 
28, 35]. The pioneer work in cryptographic key genera- 
tion from behavioral biometrics uses keystroke dynamics 
of a user while typing the password [25]. The features 
of interest are the duration of keystrokes and the latency 
between each pair of keystrokes. The generated crypto- 
graphic key is called the hardened password. However 
the password generated is not very long and is suscepti- 
ble to brute-force attacks [25]. Another method using se- 
cret sharing was proposed to generate the biometric key 
from voice [24]. The distinguishing biometric features 
are selected based on the separation between the authen- 
tic and the imposter data, and then binarized by some 
thresholds. However, this method is not resistant to coer- 
cion attacks (which our proposed model trying to target), 
as the attacker can force the user to speak out the pass- 
word in a normal way. We will discuss key generation 
approach from voice in more detail in the formal frame- 
work of our model (see Section 3). 

Another work on key generation from voice uses 
phonemes instead of words, as it is possible to gener- 
ate larger keys with shorter sequences [15]. Using the 
information of the voice model and the phoneme infor- 
mation of the segments, a set of features are created to 
train an SVM (Support Vector Machine) that could gen- 
erate a cryptographic key. False-positives and entropy of 
the system were not demonstrated, which does not give a 
clear picture of the security of the scheme. 
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There are many risk and security concerns over bio- 
metric systems [32, 33, 40]. Some of the threat mod- 
els include fake biometrics at the sensor, tampering with 
the stored templates, coercion attacks. Biometrics live- 
ness detection is proposed to thwart fake biometrics at- 
tacks, e.g., by using perspiration in the skin [1] or blood 
flow [22]. However, no previous work has been proposed 
to resist coercion attacks in generating cryptographic 
keys using biometrics. There have been suggestions like 
panic alarm or duress code to fight against coercion at- 
tacks, but they are different from what we are propos- 
ing here because in previous schemes users choose not 
to generate the key but to send a signal to authorities 
without catching the adversary’s attention, whereas in 
our scheme we require that users simply will not be able 
to generate the key. It is clear that our scheme offers 
much stronger security properties. 

Previous work also shows that emotion recognition us- 
ing physiological signals, affects from speech, and facial 
expressions have various success rates between 60% and 
98% [31]. Although many techniques have been pro- 
posed for emotion recognition [31, 20, 29, 21], none has 
looked into the incorporation of emotional status into key 
generation as what we propose in this paper. 


3 Background 


In this section, we present some background knowledge 
of voice and skin conductance, and discuss why in future 
an addition of fingerprint in our model would be better 
as an authentication measure for the protection against 
coercion attack. We also discuss the reasons for the se- 
lection of these features and the advantages over others 
in terms of acceptability, feasibility and usability. 


3.1 Why Skin Conductance? 


An emotion is a mental and physiological state associ- 
ated with a wide variety of feelings, thoughts, and be- 
havior. Emotions are subjective experiences, often as- 
sociated with mood, temperament, personality, and dis- 
position [11]. This emotional behavioral change is the 
key component in our model in fighting against coer- 
cion attack. Several physiological peripheral activities 
have been found to be related to emotional processing of 
situations. Many physiological parameters were studied 
for emotion recognition, e.g., heart beat rate [3] (HR), 
skin conductance [23] (SC), EMG (Electromyography) 
signals, ECG (Electrocardiography) signals, body tem- 
perature, BVP (Blood Volume Pulse) signals, and etc., 
among which HR and SC are especially attractive due to 
their strong association with behavioral activation sys- 
tem (BAS) and behavioral inhibition system (BIS) re- 
spectively [14]. 
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SC is the change in the electrical properties of an in- 
dividual person’s skin caused by an interaction between 
environmental events and the individual psychological 
state. Human skin is a good conductor of electricity and 
when subject to a weak electrical current, a change in the 
skin conductance level occurs [42]. We chose SC over 
HR for the following reasons. 


1. The skin conductance is one of the fastest respond- 
ing measures of stress response [16]. It is one of the 
most robust and non-invasive physiological mea- 
sures of autonomic nervous system activity [7]. Re- 
searchers have linked skin conductance response to 
stress and autonomic nervous system arousal [37]. 


2. The change in HR not only accounts for stress but 
for many other reasons, including jogging or doing 
some heavy work load. SC, on the other hand, has 
been shown to be a promising measure in experi- 
mental studies [36] for its reliability. 


3. According to [41], HR is also impacted when stress 
levels rise but the shifts take a bit of time to happen 
and by the time the changes are noticeable the trig- 
gering stimulus is long past, whereas SC responses 
are rapid and easy to measure. 


4. HR is not suitable to our model due to prevail- 
ing feasibility issues. HR can be measured using 
an Electrocardiogram (ECG) machine or a stetho- 
scope. Using an ECG machine is impractical be- 
cause it is very cumbersome due to many (at least 
three) electrodes required and installation costs [6]. 
Stethoscope is not good either because different 
placements of the stethoscope could lead to high 
FTC rate (failure to capture rate) [30]. 


5. Using SC has an extra advantage as it can be mea- 
sured simultaneously while fingerprints are being 
scanned. This ensures that SC is measured from 
the authentic person (more on this in the coming 
subsection). The wide acceptance of finger scan- 
ning [18, 39] also suggest that SC measurement 
would have the potential to gain user acceptance. 


There are some limitations of using skin conductance as 
with any other biometric. Some skin lotions can be used 
to manipulate the skin conductance level. In a test done 
by [34], the usage of specific solutions produced signif- 
icant increase in skin water content, and was indicated 
by increase in skin conductance level. According to the 
product after the application of the cream by EncoSkin, 
skin moisture level can be significantly increased which 
can be monitored by skin conductance [12]. 
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3.2 Why Voice? 


Voice has been used previously to generate cryptographic 
keys [15, 24]. Voice as a biometric is desirable for gener- 
ating keys for two important reasons. First, it is the most 
familiar way of communication, which makes it ideal for 
many applications. Second, voice is a dynamic biometric 
and is not static like iris or fingerprint. A user can have 
different keys for different accounts by just changing the 
password (what to pronounce) or the vocalization of the 
same password (how to pronounce) to generate differ- 
ent cryptographic keys. In an event of key compromise a 
new cryptographic key can be easily generated. Note that 
voice has a potential disadvantage when used in fighting 
against coercion, namely that the attacker may blame the 
user for intentionally pronouncing the wrong password. 
We demonstrated our technique with voice; however, our 
scheme is not limited to using voice, other biometric can 
be used as well. 


3.3. Why Fingerprint? 


A potential threat to our biometric system is to use spo- 
ken password from the genuine user (under stress) and 
SC responses from another person (normal emotional 
state). To ensure that SC is not unforgeable, one can 
make use of a device to collect fingerprint and skin con- 
ductance of the user at the same time so that the finger- 
print of the user can be checked and mapped to his/her 
skin conductance signal. However, we did not demon- 
strate how to use this as a measure 1n our proposed model 
as this is not the contribution of this paper and is left for 
the future work. 


4 Key Generation from Voice and Skin 
Conductance 


In order to show how skin conductance can be used to 
fight against coercion attacks in cryptographic key gen- 
eration, in this section, we present the details of a cryp- 
tographic key generation technique using voice and skin 
conductance. Note the criteria behind choosing skin con- 
ductance and voice in Section 3. Other biometrics in lieu 
of voice could be used as well. Our way of using voice is 
similar (with some differences) to an earlier proposal of 
generating cryptographic keys using voice [24]. Table 1 
shows some notations used in the rest of this paper. 


4.1 An Overview 


Inputs to our model include the voice captured when the 
user utter the password into the microphone and the skin 
conductance measured. Figure 2 shows the input devices 
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Figure 2: Input devices 


we used in our experimental setup. Output of our model 
is a cryptographic key generated. 

In the first phase (Figure 3 (a)—(h)), features ex- 
tracted from the spoken password are used to generate 
a sequence of frames fy(1),..., fy(n) @B (c)), from 
which an optimal segmentation of s segments (compo- 
nent sounds) (3 (f)). The segmentation obtained are then 
mapped to the feature descriptor using a random ay 
plane (3 (g)). Furthermore, features are also extracted 
from the SC sample and the corresponding feature de- 
scriptors are computed (3 (h)). These feature descrip- 
tors should be “sufficiently similar” for the same user and 
“sufficiently different” for different users. By the end of 
the first phase, we have feature descriptors for both voice 
and SC signal. 

In the second phase (Figure 3 (i)-(1)), we perform 
lookup table generation and cryptographic key recon- 
struction. A total of Ny samples from voice and Ngoc 
samples from SC are used to generate lookup tables Ty 
and T’sc . In cryptographic key reconstruction, feature 
keys are generated from the spoken password (my bits) 
and SC (mgc bits). The two lookup tables generated 
and the features keys are then used to generate the cryp- 
tographic key. 

In the next two subsections, we will present these two 
phases in more detail. 


4.2 Phase I: Feature descriptors derivation 
4.2.1 Feature descriptors from voice 


In the last six decades, speech recognition and speaker 
recognition have advanced a lot [8]. A speaker recog- 
nition system usually has three modules: feature ex- 
traction, pattern matching and decision making, among 
which feature extraction is especially important to our 
research as it estimates a set of features from the speech 
signal that represent the speaker-specific information. 
These features should be consistent for each speaker and 
should not change over time. The way we extract these 
features and derive the feature descriptors is very simi- 
lar to the previous approach [24], except that we use the 
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Table 1: Notations 


Mel-frequency Cepstral Coefficients (MFCCs) instead of 
linear cepstrum [24]. MFCC has advantages over linear 
cepstrum that the frequency bands are equally spaced on 
the mel scale, which approximates the human auditory 
system’s response more closely than the linearly-spaced 
frequency bands used in the linear cepstrum [13]. 


Associating centroids to the acoustic model We con- 
vert the raw speech signal into a sequence of acoustic 
feature vectors in terms of the Mel-frequency Cepstral 
Coefficients (MFCCs) [10]. In the next paragraph we 
provide a short description on the extraction of MFCC 
(see Figure 4). 
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Figure 4: Block diagram of extracting MFCC 


The voice signal is first divided into blocks of 20 to 30 
msec (see Figure 3(a)), and Discrete Fourier Transform 
(DFT) is performed to obtain the frequency representa- 
tion of each block. The neighboring frequencies in each 
block are grouped into bins of overlapping triangular 
bands of equal bandwidth. These bins are equally spaced 
on a Mel-scale instead of a normal scale as the lower fre- 
quencies are perceptually more important than the higher 
frequencies. The content of each band is now summed 
and the logarithmic of each sum 1s computed. To see this 
effect in time domain, Discrete Cosine Transform is ap- 
plied to yield a “spectrum like” representation ~(t) that 
collectively make up an MFC, and w(1),...w~(12) are 
called MFCC, where higher order coefficients are dis- 
carded. This vector is called a frame (fy). 

We run a sliding window of 30 msec over an utterance 
to obtain blocks 10 msec apart from one another, and ex- 
tract the MFCC, (w(1),...(12)), for each block (see 
Figure 3(b)). n frames are obtained from utterance of the 
password (see Figure 3(c)). An acoustic model of vec- 
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tors from a speaker-independent and text-independent 
database of voice signals is obtained, from which vector 
quantization is used to partition the acoustic model into 
clusters (see Figure 3(d)). A multivariate normal distri- 
bution for each cluster is generated, where each cluster 
is parameterized by the vector c of a component-wise 
means (called a centroid) and the covariance matrix J 
for the vectors in the cluster. The density function for 
this distribution is 


Pé\s)=—— ed 02 
(2n)5/2, /det(S) 


where 6 is the dimension of the vectors. We denote the 
set of centroids as C’. 


Segmentation of frames After getting the centroids 
from a speaker-independent database of voice signals, 
we try to obtain the transcription, 1.e., the starts and ends, 
of the phonemes of an individual user’s utterance. 

To do this, we perform segmentation on the spo- 
ken password. Let fy(1),...fy(n) be the sequence 
of frames from the utterance, and F'(R,),... F(R.) be 
the sequence of s segments (s is a constant and same 
for all users), where F'(R;) is the i*” segment contain- 
ing the sequence of frames fy (j),... fv (j") such that, 
1<j <j’ <n. Intuitively, each F'(R;) corresponds to 
one “component sound” of the user’s utterance. 

We did this with an iterative approach (see algo- 
rithm 1). Ranges R,,...,A, are first initialized to be 
equally long. We then calculate the matching centroid c 
for a segment F(), 1.e., the one for which the likelihood 
of F(R) w.r.t. c is maximum. Dynamic programming is 
then used to determine a new segmentation for that frame 
sequence. This process is repeated until an optimal seg- 
mentation is obtained, which is mapped to the feature 
descriptor (see Figure 3(e,f)). 


Feature descriptor Having derived a segmentation for 
a spoken password, we next define the feature descriptor 
(dy) of this segmentation that is typically the same when 
the same user speaks out the same utterance. To do this, 
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Algorithm 1 Spoken password segmentation 
Segmentation (fy (1),..., fv(), s) 


1: Score <— 0 
2: for2 = 1tosdo 


rte [OAR ee) 


4: end for 

5: repeat 

6: Score —— Score 

7: forz =1tosdo 

8: while Vc € C' do 

S: L(F(Ra)le) — J] Cv @le) 

jE hj 

10: end while 

11: c(R;) — arg max {L(F(R)|c)} 
cEC 

12: end for 


Ss / 
13: let Ue R, — [1,n] 
S 
14: Score’ — J] L(F(R;le(R:))) 
i=1 
15: Ry — R, 
16: until Score - Score < A 


we use a fixed vector ay, and define the 7” bit of the 
feature descriptor as (see Figure 3(g)) 


evi) =aviluvihs) —clR;)); Vo isis 


That is, we normalize py (R;) with c(R;) and let dy (i) 
be the linear combination of components in it as specified 
by ay. This process results in a feature descriptor (¢y ), 
where Ny feature descriptors are then generated from 
Ny voice samples and used to generate a lookup table 
Ty (in Phase II). 


4.2.2 Feature descriptor from skin conductance 


When some external or internal stimuli occur that makes 
a person stressed, the skin becomes a better conductor of 
electricity. This conductance can be measured between 
two points on the body (e.g., two fingers) and the level of 
electrical conductance is called skin conductance. Since 
we want to detect changes in the emotional status of a 
person, we record skin conductance over a time period. 

SC signal was measured with our device and sam- 
pled at a frequency of 30 samples per second. Let 
fsc(1),.--, fsc(€) denote the sampled values obtained 
from the SC signal. We model the feature values into a 
feature descriptor (@sc) in a similar way as we did in the 
processing of voice. We choose a random vector asc= 
lasc(1), asc(2),...,asc(msc)] (msc is a constant), 
and use the Euclidean distance between all the points of 
the ago vector and fsc to compute the distance measure 
M and henceforth the feature descriptor (dsc). 


Mii,9)=a9oW) <fscG) Vo Lstsmegixg se 
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@sc is the mean of all the distance measures for each 
asc(2) values (see Figure 3(h)), 1.e., 


sl Re 


dsc(t) = 


£L 
Mia.9): N° Loa s1ge 
q=1 


Note that the upper bound of agc(z) needs to be care- 
fully chosen to maintain a good entropy on the feature 
descriptor of different people. Also note that we do not 
store skin conductance information directly but rather the 
feature descriptor generated from the distance measure 1s 
stored (same as in the case of voice). Nsc feature de- 
scriptors are derived from Nsc SC samples and then are 
used to generate a lookup table T’sc¢ (in Phase II). 


4.3 Phase II: Lookup table and crypto- 
graphic key generation 


We explain how we obtained the feature descriptors from 
voice and skin conductance in the previous subsection. 
Here, we will explain how we constructed lookup tables 
(training of the model) and obtained the cryptographic 
keys from the tables (usage of the model). The basic idea 
is that each entry of the lookup tables contains a share of 
the correct key or some garbage value, and the feature 
descriptor is used to determine the corresponding entry 
from the lookup table. In the end, the shares from the 
lookup tables are used to reconstruct the key. 


4.3.1 Lookup table generation 


Intuitively, if a feature descriptor is the same as the one 
recorded previously (i.e., in training), then the system 
should choose the correct key share from the lookup ta- 
ble, or the garbage otherwise. In order to tolerate some 
small deviation of a user’s utterance and skin conduc- 
tance, we calculate the mean (wg, (1), {465 (1)) and stan- 
dard deviation (74,,(i), 0¢;,(1)) of each feature descrip- 
tor over Ny, Ngsc training samples, and define the par- 
tial feature descriptors By, Bsc as 


O, if Loy (i) + ko gy, @) < ty 
By (4) = 1, if ug, G) - kog,, 0) > ty Vict my 
1, otherwise 


0, If dog QM) + kogcoW <tso 
Bgc(t) =e If doo WM) - ko goo W >tsc Vl<i<mse 
1, otherwise 


for some threshold ty and tgc respectively (see Fig- 
ure 3(j)). This phase is the training phase in our model. 
Here & is a parameter to acquire a tradeoff between se- 
curity and usability. With the increase in value of k, the 
user has better chance to generate the key successfully, 
but will hamper the security of the scheme. More pre- 
cisely, the increase in the value of & will increase the 
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false-positive rate and decrease the false-negative rate (as 
shown in our results in the evaluation Section 6). 

The idea of defining the partial feature descriptor 
in this way is illustrated in Figure 5 (where the set 
{B,pu,o,t} is replaced by {By,Mg,,o¢,,tv} and 
{Bsc besos %osc,tsc} for voice and skin conductance 
respectively). If the 7“” feature descriptor is consistently 
same ie. p1(2) + ko(z) < t (the first case in Figure 5), 
then there is a high probability that the value of the i@” 
feature descriptor will be less than t during key recon- 
struction. Therefore, we can let the cell T'(i,0) of the 
lookup table contain a valid share of the key (and let 
T (i, 1) contain random bits). If the i‘” feature descrip- 
tor is consistently different, i.e. the value of the feature 
descriptor is unreliable (when compared to the thresh- 
old ¢ as in the third case in Figure 5), we let both T(z, 0) 
and T'(i, 1) contain valid shares (typically different). Un- 
like [24], lookup tables are not encrypted (for discussion 
on this, see section 4.4). 

















oo u-ko wutko 
as u-ko putko 


























B= 1 





u-ko | utko 


Figure 5: Definition of partial descriptor 


Having valid shares in both T(2,0) and T’(i,1) 
leads to different key shares used and consequently 
different keys being generated, which might not be 
desirable in systems that require a unique key. ‘To 
solve this problem, a random cryptographic key K 
(unique for each user) is first generated, which is then 
encrypted with all possible valid keys (/(y,) that can 
be derived from <Ty||Tsc>. The key generation 
template therefore comprises of key XK encrypted 
with Z = |Ky,| derived keys and the lookup tables 
<Ty||Tsc>. Thus, the template = <<Ty|Tsc>, 
< Eig EB) ey (|B), veep (RB) SS, 
where Ex,,,(msg) is a publicly known encryption 
algorithm and B is a unique string associated to each 
user which helps us to determine whether the decryption 
is correct or not in section 4.3.2. 





4.3.2 Cryptographic key reconstruction 


When a user tries to reconstruct the cryptographic key, 
he/she first presents his/her spoken password and the skin 
conductance. The model collect this information, ex- 
tracts the features and generates the feature descriptors 
for both voice and the SC. Corresponding shares from the 
lookup tables are chosen based on the feature descriptors. 
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0 if py (4) <t 
by (i) = POONER Ri aie Se 
1 otherwise 
0 if ea 
bso(i) = UOSCMISESC! i cen iis 
1 otherwise 


For example, if the feature descriptor @gc(2) is less than 
the threshold tgc, then bsc(z) = 0 and Tsc(i, 0) is cho- 
sen from T's¢ as a key share; otherwise bsc(t) = 1 and 
Tc (1, 1) is chosen (see Figure 3(i)). by and bsc are the 
feature keys and are obtained from voice and SC respec- 
tively. 

A key K’" is derived by concatenating the key shares 
(see Figure 3(k)). This derived key is then used to de- 
crypt the | z,| encrypted keys stored in the template. If 
the decryption succeeds (by matching the released B and 
the stored B), then the key X is released. 





Kn = Dx (Exy, (K|B)), if K'=Knu, 
= Random, if K'# Ku, 


where, Dx (msg) is a publicly known decryption algo- 
rithm. 


4.4 Discussions 


While we try to use the consistency of voice and skin 
conductance to generate the correct key only when it is 
the genuine user in the normal emotional state, the incon- 
sistency of voice and skin conductance poses challenges, 
too. Voice produced and skin conductance measured of 
the genuine user in a non-stressed emotional status might 
change due to tiredness, illness, noise, and etc. 

We used an error correction technique, in particu- 
lar, hamming distance, to improve the usability of the 
scheme. "Cg different keys are derived from any freshly 
generated key K’ obtained from the feature descriptors 
and 7’ (similar to the one derived in section 4.3.2), which 
are d distance away from the derived key K’. All of these 
™ Cig keys are then used to decrypt the encrypted keys be- 
fore giving any negative answer to the user. If the decryp- 
tion succeeds then the key XK is released. For example, if 
d = 2 and length of the key is m, then "’C> different 
keys are derived. Thus, |Ky,| x’ C2 decryptions are 
performed in attempting to recover K. 

Another issue concerns the privacy of the biometric 
data used. Ballard et al. propose using randomized bio- 
metric templates protected with low-entropy passwords 
to provide strong biometric privacy [4]. One can use this 
in conjunction with our model to provide both coercion 
resistance and biometric privacy. However, it is unclear 
whether the use of low-entropy passwords may have a 
negative impact on coercion resistance since, intuitively, 
an attacker may blame the user for providing the wrong 
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low-entropy password in a coercion (similar problem dis- 
cussed in section 3.2). We leave this as future work to 
develop a solution that satisfies both requirements. 


5 Experimental Setup 


We presented our design in generating a cryptographic 
key using voice and skin conductance in Section 4. It is 
important to test it out with real human beings to evalu- 
ate its performance. However, this is difficult as we need 
to find a way to make the participants feel stressed or 
nervous. It is clear that we cannot actually coerce them 
to do something by, e.g., putting a gun over their heads. 
Nevertheless, we performed case studies to induce stress 
on the participants and measure their voice and skin con- 
ductance. (IRB approval was obtained from our univer- 
sity before the user study.) We present the experimental 
setup in this section and the evaluation results and dis- 
cussion in the next section. 


5.1 Demographics 


Since we were going to induce stress on the participants, 
we decided to concentrate on the younger generation (un- 
dergraduate and graduate students in the age from 18 to 
30). We had altogether 43 participants, from which 4 
participants detached the sensors from their fingers when 
they were nervous during the experiment. Therefore, we 
successfully performed our experiments on 39 partici- 
pants, out of which 22 were male and 17 were female. 


5.2 Experimental settings 


Participants were asked to sit in a small office where the 
overhead fluorescent lights were turned off and a dim red 
incandescent lamp was turned on to reduce the possible 
electrical interference with the monitoring equipments. 
The room was air conditioned to approximately 72°F and 
humidity level was generally dry. This is done in accor- 
dance to the variation of skin conductance in different 
environmental conditions [36]. 

Skin conductance sensors! were attached to the three 
middle fingers of the participant to record SC (shown in 
Figure 2). The participant was also asked to keep her left 
hand (with sensors attached) as still as possible to avoid 
interference from the sensors. Fake heart rate tags were 
tied to the wrist, which gave an illusion of monitoring the 
heart rate. 

Initially, there was an incomplete disclosure regarding 
the purpose and the steps of the study in order to ensure 
that the participant’s responses will not be affected by her 
knowledge of the research. 
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5.3. Procedure 


We ran two experiments (el and e2). Each experiment 
consisted of two parts, where the first parts (e1n and e2n) 
were conducted when the participants were in a normal 
(calm) condition, and the second parts (els and e2s) were 
conducted when the participants were stressed. 

We ran experiment eln by 


e showing nice (geographical) pictures one after an- 
other and short phrases (the spoken password em- 
bedded) which are related to the pictures, and ask- 
ing the participant to read them out; 


e showing fake visual heartbeats at a normal rate at 
the bottom of the screen and correspondingly play- 
ing heartbeats sound. 


In order to capture the emotional responses in the 
stress scenario in els, 


e a frightening horror movie was played, replacing 
the nice pictures; 


e the rate of the heartbeats were gradually increased 
to induce more stress on the participant; 


e the participant was asked to read out some short 
phrases at the end of each horror scene (rather than 
along with the video) to avoid distraction. 


Similar studies have been performed previously to 
measure the stress level in users [26, 19]. 

In e2, we went a bit further to induce more stress on 
the participant. Figure 6 shows the change in skin con- 
ductance in response to different events in e2. During e2, 
the participant was asked to type a few sentences (e.g., 
“Work is much more fun than fun’) shown to her in a 
fixed period of time. She was also warned (prior to the 
experiment) not to press the “ALT” key on the keyboard, 
as it would cause the computer program to crash and all 
data would be lost (event A). We then left the partici- 
pant alone in the room to continue typing (event B). We 
configured the computer to restart after 3 minutes irre- 
spective of whether the participant actually touched the 
“ALT” key or not. The computer would then boot from 
a USB drive into MS-DOS and display some error mes- 
sages (event C). This completes the first part of e2, 1.e., 
e2n. 

Stress started to develop at this point in time as the 
participant believed that she had pressed the “ALT” key 
which caused data loss on the computer (event D). We 
purposely left the participant alone so that stress could 
develop further and she could not get immediate help to 
resolve the “problem”. After that, the researcher entered 
the room and examined the keyboard and the computer 
(event E) and then accused the participant of her negli- 
gent act of pressing the “ALT” key (event F). This turned 
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Figure 6: Change of skin conductance in e2 


out to be successful in making the participant stressed as 
we observed that many participants were nervous at this 
point in time. Some kept saying “‘sorry’’; some tried very 
hard to fix the “problem”, and some started calling for 
help. There were also voluntary confession statements 
from the participants, e.g., “I hit the ALT key by mistake 
in place of typing the *X’ key’, “It was a mistake from 
my side.’. 


5.4 Discussion 


In this section, we discuss the difference of the emotional 
state of a user in real life and in our user study, and limi- 
tations of our experiment. 


1. Training of the system 


e Real life: the user is in a (controlled) envi- 
ronment specified by our system, in which the 
stress level is low. This allows us to generate 
the lookup table for that particular user with 
the normal skin conductance level. 


e User study: the user is in exactly the (con- 
trolled) environment specified by our system, 
1.e., when watching a relaxation movie. 


2. Trying to generate the cryptographic key; no coer- 
cion 


e Real life: a user could be in various emotional 
states, including being happy, sad, angry, etc. 


e User study: same as in training when the user 
is watching a relaxation movie. In this work, 
we only try to analyze how our system per- 
forms when users are calm and relaxed. It 
remains future work to analyze how it works 
when the user is in other emotional states. We 
do expect the false-negative rate to rise when 
the user is in other emotional states. 


3. Trying to generate the cryptographic key; in coer- 
cion 
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e Real life: a user can be forced/coerced in 
many different ways, e.g., a gun to the head, 
or a knife under the throat, etc. 


e User study: watching a horror movie and be- 
ing forced to plead guilty (having damaged a 
notebook computer). We tried our best to ap- 
proximate the real-life scenarios, but there is a 
limit we could go when doing this to real hu- 
man beings (e.g., IRB restriction). However, 
we believe that what we did is a clever way of 
studying human behavior when being coerced. 


Discussions above highlight some limitations of our 
scheme, e.g., we have not tested how it reacts to other 
emotional status (happy, sad, angry, etc.) and how skin 
conductance may change naturally (due to oily fingers, 
etc.). There are two other important limitations in the 
present study. First, our study does not test the repeata- 
bility of using our scheme, 1.e., we did not ask the partici- 
pants to come back and try again. The second limitation 
comes with the over-controlled environment, e.g., quiet 
office (because of the use of voice), controlled temper- 
ature and humidity [9](because of the use of skin con- 
ductance), and etc. It remains further work to test our 
scheme in different settings. 


6 Evaluation and Discussion 


In this section, we analyze the data collected in our user 
study. We first describe how we partition the data into 
different groups (e.g., for training and test purposes), see 
Section 6.1. We then present a series of analysis on the 
false-positive and false-negative rates (Section 6.2). Fi- 
nally we show the change in the password space where 
an attacker has perfect knowledge of our design and the 
content stored. We show that this change in the password 
space in this worst case is small (Section 6.3). 


6.1 Training and Testing Datasets 


We have collected voice and skin conductance signals 
for 39 participants. For each participant, we have col- 
lected many samples of the signals when the participant 
is either calm or stressed. Table 2 shows the number of 
samples we collected in each experiment for each par- 
ticipant. Voice signals are typically 2 to 3 seconds long, 
while skin conductance signals are about 10 seconds long 
to avoid fluctuations. 
Figure 7 shows how we obtain dataset to 


: oe full full full 
e split original sample sets {121,,, Wein» ean} into 


train train train test 
two equal halves {vi2i", wi”, wen} and {vsss, 
Wes, Wes + to obtain datasets for training and test- 


ing (see the half circles); 
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Table 2: Number of samples collected for each partici- 
pant 


e combine different voice samples and skin conduc- 
tance samples to create new datasets to test our sys- 
tem (see circles in the middle column). {vi2i" & 


1 
oar), (AR & uti), (ule & ua}, (se 
& wi$x*} are combined to create {E22}, {ESS}. 
{€e5n'J> {Seon} Tespectively. 
e to obtain the stress dataset {vit & will), fpfull & 
will} are combined to create {él}, fet!) respec- 
tively. 
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V ols 
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~~ — ~~ — 
Figure 7: Splitting and combining datasets 


Note that the voice and skin conductance samples that 
are combined together might not have been captured at 
exactly the same time. We allow a time gap because an 
attacker might record the voice of the victim to be used 
in conjunction with the skin conductance of the victim at 
a slightly different time. Both samples were captured in 
the same part of the experiment, though, i.e., both from 
els or both from e2s. 


6.2 Accuracy of our model 


The false-negative rate of our system is defined as the 
percentage of failed login attempts by a legitimate user 
with her cryptographic key generated, averaged over all 
users in a population A. Similarly, the false-positive 
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rate is defined as the percentage of failed detection of 
attempts by illegitimate users or legitimate users in a 
stressful situation, averaged over all users in a popula- 
tion A. 


Voice samples only We first evaluate the voice samples 
we collected in our experiments. The purpose is to check 
out the false-positive and false-negative rates, in an event 
if only voice samples are used to generate cryptographic 
keys. The system is trained with v*¥?'" of user a;, and 
is tested against vit" of user a, ene i -=3. 07 © 
A to calculate the false-positive rates; and against v5 
of user a; to calculate the false-negative rates. Results 
are averaged on all users in A. We try different random 
ay vectors and choose the one that yields the smallest 
sum of the false-positive and false-negative rates. We try 
different settings of the hamming distance parameter d, 
and find that 2 gives a reasonable tradeoff between false- 
positive and false-negative rates. The false-positive and 
false-negative rates for different values of & are plotted 
in Figure 8. 
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Figure 8: False-positive and false-negative rates for spo- 
ken passwords 


Figure 8 shows that we manage to get a comparable 
accuracy with the previous work [24] in terms of the 
false-negative rate. False-positive rate was not reported 
in [24]. 


Skin conductance only Next, we evaluate the skin 
conductance samples to see how well they reflect the 
change in the participants’ emotional status. We show 
the results in Figure 9(a) and Figure 9(b) for experiment 
el and e2, respectively. The different color lines denotes 
different ‘k’ values in Figure 9 and Figure 10. The sys- 
tem is trained with wt?!" (and w't?'", respectively) of user 
a;, and is tested against the stressed full data set, wf! 
(and wit! respectively) of the same user a; to calculate 
the false-positive rates; or against the normal test data 
set, weiss (and wis, respectively) of the same user a; to 
calculate the false-negative rates. Results are averaged 


over all users in A. 
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Figure 9: False-positive and false-negative rates for skin 
conductance 


Note that the false-positive and false-negative rates are 
higher for el in Figure 9(a). We believe, this is because 
of the reason that the intensity of some of the horror 
videos was not very high, which did not result in a no- 
ticeable change in the skin conductance for many users. 

We can observe the tradeoff of various settings of k 
and the threshold from these figures. In general, this 
shows that whenever a user is under stress, her skin con- 
ductance can be used to differentiate between the two 
emotional state with good accuracy. For example in e2, 
when k = 1.25 and tsc = 2.1, we obtained a false- 
positive rate of 3.2% and a false-negative rate of 2.2% 
(see Figure 9(b)). If we increase the value of & from 1.25 
to 1.875 in both Figures 9(a) and 9(b), we could see 
a decrease in the false-negative rates (increasing usabil- 
ity) and increase in the false-positive rates (compromis- 
ing with the security). We used the hamming distance 
parameter d = 2 in our setting. 


Voice combined with skin conductance Voice and 
skin conductance samples are combined as shown in Fig- 
ure 7 to obtain the samples needed in this evaluation. 
We first train the system with €'", and then evaluate 
the system against three different datasets to evaluate the 


false-positive and false-negative rates. 


19th USENIX Security Symposium 


‘ull of user a; where i # j, Vj € A: whena differ- 


ent person tries to generate the key (Figure 10(a)); 


a 


b €étul! of user a;: when the same user tries to generate 


the key when she is being coerced (Figure 10(b)); 

5 Of user a;: when the same user tries to gen- 
erate the key when she is not being coerced (Fig- 
ure 10(c)). 


C 


We evaluate the false-positive rates in the first two 
cases and the false-negative rates in the third case. Re- 
sults are averaged over all users in A. We use a hamming 
distance parameter d = 4, and show the results in Fig- 
ure 10. 
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Figure 10: False-positive and false-negative rates for 
voice combined with skin conductance 
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These results show that generating cryptographic keys 
from voice and skin conductance is effective in fighting 
coercion attacks, as we observe false-positive rates be- 
tween 6% to 15% for 1 < tsc < 4, which can also 
rise up to 22% for tsc > 5. False-negative rates are 
between 0% and 4.5% for all values of tsc. Further ef- 
forts are needed to reduce the false-positive and false- 
negative rates. Same as in the previous subsection, if we 
increase the value of & from 1.25 to 1.875, we could see 
a decrease in the false-negative rates and increase in the 
false-positive rates. 


6.3 Change in password space 


In this subsection, we discuss more advanced attacks on 
our system (if implemented) beside forcing the victim 
to obtain her spoken password and skin conductance. If 
such system is implemented, then we need to approx- 
imate the entropy in the worst case of these advanced 
attacks, in which the attacker makes use of the group in- 
formation about the skin conductance and information 
stored in the key generation module. 

The group information about skin conductance refers 
to the patterns observed in the change in the users’ 
feature key generated from the skin conductance (bsc) 
when they are coerced. An attacker could use this in- 
formation to selectively modify the victims skin conduc- 
tance feature key in order to improve the probability of 
generating the correct key. To know how we obtained the 
feature key (bsc) for SC, see section 4. 

Although we do not store any biometric information 
of the user directly on the device (see discussions in 
Section 4), we still need to store the lookup tables (Ty 
and T’sc¢) which are derived from the user specific data 
(e.g., feature descriptors). Although this table can be en- 
crypted with a user password as discussed in previous 
work [24], however we try not to rely the security of our 
model on the secrecy of this table because we are dealing 
with coercion attacks. In the rest of this subsection, we 
assume that an attacker has perfect knowledge in both 
the group information about skin conductance and the 
lookup tables. We want to approximate the guessing en- 
tropy, i.e., the reduction in the password space for this 
more powerful attacker. 

More precisely, we assume in the worst case that an 
attacker has access to 


e the lookup tables Ty and T'sc; 


e the recorded spoken password of the user and the 
corresponding feature key {by (2) }; 


e the recorded skin conductance when the user 
is stressed and the corresponding feature key 


{b3q(t)} 
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e the database D which contains the mapping of the 
SC feature keys when users are normal ({b3,(i)}) 
to the scenario when they are stressed ({b2,,(i) }) 
for all users in a population A. 


A sample database D for such mapping of SC is shown 
in Table 3 for |A| users. Each row in the table is a 
record of the feature key of a user when she is normal 
and stressed, and the last column shows the index of the 
feature keys that had changed from b4, to bac. 


011011011011 001101110011 24,5,/,9 
010010010111 010100110110 4,5,7,12 





010101001100 111111100110 [337,901 


Table 3: A sample Database D 


The attacker’s strategy would be to analyze D to learn 
patterns in which people’s feature keys 10ca* changes 
to {b2c}, e.g., whenever the i-th index of the feature key 
changes, the 7-th one will change too. 

These patterns can be easily learned by applying a well 
studied technique called association rule mining [2]. The 
attacker can then use these patterns to reduce the pass- 
word space. Here, we use a simple example to demon- 
strate the idea. 

We first represent the password space by a sequence of 
0’s (the corresponding index in {b2,;} will definitely not 
change when a user’s emotional status changes), 1’s (the 
corresponding index in {b2,,} will definitely change), 
and «’s (don’t know), e.g., |1, *, *] represents a password 
space in which only the first index of {b2,,} will change, 
and therefore the password space is 27 = 4. When the at- 
tacker makes use of a pattern learned, e.g., “the change of 
the first index of {b2,,} implies the change of the second 
one”, he can convert the password space from [1, x, *| 
to [1, 1, *], since the second index of the {b2,,} will defi- 
nitely change, too. With this, the password space reduces 
too =). 

We present the detailed algorithm with an example in 
estimating this reduction in the password space in the 
Appendix A. 

We constructed the database D with the skin conduc- 
tance samples collected in our user study, mine all as- 
sociation rules, and then use the above algorithm to find 
out the change in the password space. Figure 11 shows 
the results for different settings of the threshold and min- 
imum confidence in the association rule mining. 

k, is set to 1.25 in this experiment, and the minimum sup- 
port is set to 30%. Note that the original password space 
is 2sc — 2°°. Although in the worst case the effective 
number of bits to represent the password space reduces 
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Figure 11: Password Space reduction 


by roughly 20%, many settings of the threshold value re- 
sult in only 10% reduction. 

Another way to attack our system is to make the user 
take a sedative to relieve his/her anxiety before capturing 
SC. The attacker can then use this skin conductance to 
generate the key. We are trying to collaborate with med- 
ical practitioners and researchers to see the correlation 
between the two skin conductances, one under normal 
condition without taking any sedative and the other un- 
der coercion and having taken the sedative. For now this 
remains as a future work. 


7 Conclusion and Future Work 


In this paper we present a novel approach for fighting 
against coercion attacks in generating cryptographic keys 
using skin conductance (SC) of a person. In coercion 
attack, the attacker forces a user to grant him access 
to the system. SC was used to determine the person’s 
overall arousal state i.e. (emotional status). The change 
in the emotional status of a person results in different 
keys. We discussed the reasons of adopting SC as an 
emotional response parameter and why it was preferred 
over other physiological signals like Electrocardiogra- 
phy, Electromyography, Heart Rate, respiration, skin 
temperature etc. In this paper, we have chosen skin con- 
ductance along with voice in generating cryptographic 
keys; however, one can choose any other biometric for 
e.g. iris, fingerprint, face etc. in lieu of voice. Crypto- 
graphic key is generated using lookup table method as 
discussed in [24]. 

In our knowledge the presented work is the first in 
fighting coercion attacks in generating cryptographic 
keys. We conducted two experiments in our user study 
and have shown some interesting results. The proposed 
model was tested with 39 user’s voice and skin con- 
ductance data to compute the false-positive and false- 
negative rate. Furthermore our results showed that the 
cryptographic key generated in two different scenarios 
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are different for the same person. This bolsters our 
heuristic to use skin conductance for fighting against co- 
ercion attacks. As both skin conductance and voice are 
not static biometrics, in some cases we obtained high 
false-negatives. We evaluated the security of the pro- 
posed model in terms of entropy and several threat mod- 
els and discussed how difficult it is for an attacker, in an 
event when she has full information about the key gener- 
ation module; the skin conductance of the victim in the 
stressful scenario; and the group information about the 
skin conductance. 

Note that guessing entropy and guessing distance [5] 
might provide deeper insight in the security of our model. 
We leave it as our future work. In terms of feasibility, in 
future we will also like to see in some possibilities of 
building the system (may be a mobile device) with all 
three: voice, skin conductance and fingerprint extraction 
mechanism to authenticate to the system. Furthermore, 
we would like to look into other emotional responses like 
happy, joy, anger, sad etc., to make the claim of using 
SC in fighting coercion attacks stronger. This paper does 
not study the repeatability of the key using the proposed 
scheme and is left as a future work. 
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A Guessing Entropy for Skin Conductance 


Let R be the set of rules the attacker can use to reduce 
the password space from S to S’. So, for a rule R; 


antecedent(A) = consequent(C) 
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such that, A=[y1,...,Yyz,] and C=[21,..., 2z,], where 
FE, are the elements in the antecedent and EF, in con- 
sequent. The process of calculating the new password 
space from a given one is shown in algorithm 2. S’ indi- 
cates a lower bound for the password space which shows 
the minimum number of combinations an attacker needs 
to guess if he has a full knowledge of the mappings in the 
database. 

Let WV denote the candidate set and ® be the large 
itemset, UV! and ®/ are the two dimensional vectors 
derived from the rules Ai,...,R;. Each item (ws ) 
in a W/ is a vector of the form [71,%2,...,2%mecol, V 
0 < J < L, where x; € (0,1, *) and L = |W’|. Sim- 
ilarly, each item (67) in a ®/ is also vector of the form 


I21,22,---,Lmso|,V0< J < L, where x; € (0,1, *) 
and L = |®!}. 
Algorithm 2 Reduced Password Space for SC 
PasswdSpace (R) 

ie 69 —— [ee en ca | 

2: S——2™sC 

3: for [= 1 to |R| do 

4: L+<— length(®/—!) 

5: Wl — NULL 

6: tr oe : 

: 1 gi-l _— 

7 if any (ort, D538 om —— *) then 
8: we—wviy split! = 

9: else 

10: VW — vi Ue! 

11: end if 

12: end for 


13: W! — unique (vw!) 
14: cnt+— 1 

15: L+<—length(v/) 
16: forJ=1 oe 


p=1 q=1 
18: delete (a) 
19: else 
20: Pint — Vt 
ale cnt++ 
2a end if 
23: end for 
24: end for 
52'S 2 olf 


* denotes don’t care and can be assigned 0 or 1. The 
set of rules R obtained are passed to the algorithm 2 to 
generate S’. ®] is initialized to [x * * * * * *...*] andS 
= 2'™sSC. Below is the short description of the functions 
used in the algorithm. 


e length(W’) - gives the total vectors in the candi- 
date set UW ie. |W’). 


e any(®? (y1, y2,---; YE, ) == *) - a boolean func- 
tion 
_ fh (Spy, ==H*)V---V (Oz, ) ==") 
0, else 
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6 split(®4 ) - this function generates a new candidate 
set U/ from a large itemset ®/~! based on a rule 
Ry. It generates the vectors for W 5.1. 


*), V Uy 2 


I-1 
— Mark y;, if (O57. 
ie ee YEq|: 


— Generate all possible combination of the 
marked bits; which implies if total number of 
marked bits are mb then total possible com- 
binations are 2”. For e.g. if ®4 = [***1*1] 
and the rule R; is | = 2, then the result is 
({11 * 1 * 1] [10 * 1 * 1] [01 * 1 * 1] [00 * 1 * 1)) 


e unique(W/) - gives the unique vectors from W/. 


@ delete(4) - delete ws from the candidate set V/. 


During the Candidate Itemset Generation, a « in the 
large itemset triggers a split; 1 and O indicates do noth- 
ing. However during the Large Itemset Generation a 1 in 
a candidate itemset triggers add 1; 0 indicates do noth- 
ing. During the whole procedure, each time one rule is 
used and the sets which does not comply with that rule 
are omitted to create the new set. The final password 
space 1s calculated by computing the total number of vec- 
tors which can be generated using &!”!, where &!#! is the 
final large itemset generated from the rules R,..., Ayr). 

An example shown in Table 4 with 5 elements, to how 
to generate the candidate itemset and the large itemset 
from 3 rules. The total number of guesses which an at- 
tacker needs to make is /4 which implies the effective 
number of bits in the new password space are 4; original 


was 3. 
R Candidate Large 
Itemset Itemset 





Prise [Tia yas 


1 * * * x 1*1x* x 
O*« * * x | O *« * * x 


| Riss | 





Table 4: Generating candidate set and large itemset 


Notes 


'We use a physiological data acquisition device called Lightstone 
from WildDivine [38]. 
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